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Abstract 

Nowadays,  compilers  are  looking  for  more  optimizing  opportunities  by  performing  aggres¬ 
sive  code  transformations  which  introduce  high  register  pressure.  High  register  pressure 
potentially  increases  register  overhead  operations.  Register  allocation  must  deal  with  high 
register  pressure  well  so  that  the  performance  gain  of  the  code  transformations  is  not  thrown 
away  by  the  increased  overhead  operations. 

Register  allocation  must  deal  with  three  issues;  spilling,  live-range  splitting,  and  register 
assignment.  These  issues  are  closely  related  to  each  other.  Focusing  on  one  issue  and 
ignoring  the  others  may  deteriorate  the  quality  of  register  allocation.  This  dissertation 
proposes  a  novel  register-allocation  approach  that  is  fusion-based.  Fusion-style  register 
allocation  starts  off  with  constructing  regions  and  applies  graph  fusion  along  control-flow 
edges  to  combine  the  interference  graphs  of  regions  into  the  interference  graph  for  the  whole 
function.  Graph  fusion  integrates  spilling,  splitting,  and  register  assignment  in  a  seamless 
fashion.  Fusion-based  register  allocation  is  sensitive  to  the  ordering  of  control-flow  edges 
that  connect  regions.  Splitting  is  unlikely  to  happen  at  high  priority  edges  (frequently 
executed)  and  likely  to  happen  at  low  priority  edges  (infrequently  executed). 

This  dissertation  presents  a  register  allocation  framework  that  allows  us  to  model  various 
register  allocation  approaches,  e.g.,  Chaitin-style,  optimistic,  priority-based,  and  fusion- 
style  coloring.  Fusion-style  coloring  also  provides  a  nice  framework  to  model  various 
live-range  splitting  approaches  such  as  the  Tera,  SSA,  PDG,  and  Multiflow  approaches. 

This  dissertation  evaluates  the  effectiveness  of  fusion-style  coloring  under  the  register 
pressure  caused  by  scope  expanding  transformations  such  as  unrolling  and  inlining,  whereas 
the  traditional  Chaitin-style  coloring  breaks  down.  The  experimental  results  show  that  the 
number  of  register  overhead  operations  using  fusion-style  coloring  increases  slowly.  In 
other  words,  fusion-style  coloring  provides  a  solution  to  region-based  compilation  so  that 
register  allocation  is  no  longer  an  obstacle  to  region-based  compilation. 

This  dissertation  presents  and  evaluats  several  enhancements  to  register  allocation.  These 
enhancements  combine  the  strengths  of  both  Chaitin-style  and  priority-based  coloring  into 
one.  The  integration  of  these  enhancements  provides  a  model  of  dealing  with  call  cost 
(register  saves/restores).  The  experimental  results  show  that  the  model  actually  outperforms 
some  existing  approaches  that  have  been  adopted  by  researchers. 
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Chapter  1 
Introduction 


With  the  advanced  VLSI  technologies,  performance  of  processors  is  approximately  doubled 
every  18  months,  and  the  density  of  memory  is  doubled  every  24  months.  However,  the 
gap  between  the  access  time  of  memory  and  the  processor  speed  enlarges.  Based  on 
the  principle  of  locality,  a  memory  hierarchy  is  introduced  to  keep  values  that  processors 
operate  as  close  to  processors  as  possible  [35].  The  memory  hierarchy  usually  consists  of 
registers,  caches  (first-level,  second-level,  and  third-level),  and  memory.  Registers  are  the 
storage  locations  that  are  closest  to  the  CPU  and  can  be  read  and  written  at  the  clock  speed 
of  the  CPU.  Accessing  data  items  that  are  not  in  registers  incurs  penalties  because  the  data 
must  be  brought  in  from  the  next  level  of  the  memory  hierarchy.  The  access  time  of  data 
items  between  two  levels  may  differ  in  order  of  magnitude.  Because  a  superscalar  machine 
can  issue  multiple  instructions  in  each  cycle[39],  the  penalties  for  the  superscalar  machine 
are  more  severe  than  for  a  non-superscalar  machine.  One  task  of  the  compiler  is  managing 
register  storage  locations  to  keep  values  that  processors  are  most  likely  to  access  in  registers 
so  as  to  reduce  the  traffic  of  going  up  and  down  the  memory  hierarchy. 


1.1  Compiler  structure 

A  compiler  is  a  piece  of  sophisticated  and  complicated  software  that  translates  one  source 
language  to  a  target  machine’s  object  code.  To  achieve  modularity  and  reusability,  a  typical 
compiler  is  composed  of  three  major  phases:  front  end,  global  optimization,  and  back 
end  (code  generation).  The  front  end  usually  parses  the  source  code,  performs  syntax  and 
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semantic  checks,  and  finally  generates  an  internal  representation  (IR).  Global  optimization 
performs  code  improving  transformations  on  IR.  The  back  end  translates  IR  into  target 
machine  code.  The  main  purpose  of  IR  is  to  cut  loose  the  dependence  among  the  three 
phases.  For  instance,  adding  a  new  language  to  the  compiler  requires  only  a  new  front 
end.  Ideally,  the  global  optimization  and  back  end  do  not  require  any  change.  Likewise, 
translating  a  source  language  to  a  new  machine  code  requires  to  rewrite  only  the  back  end. 

There  are  some  phases  in  the  compiler  that  make  the  assumption  that  the  target  machine 
has  an  infinite  number  of  registers.  For  instance,  global  optimization  usually  performs  code 
improving  transformations  based  on  IR  without  any  knowledge  about  the  source  language 
and  the  target  machine.  Global  optimization,  therefore,  ignores  the  possibilities  that  the 
temporaries  generated  during  code  transformations  may  not  reside  in  registers.  Furthermore, 
parts  of  the  back  end  may  also  make  the  same  assumption.  One  implementation  of  the  back 
end  consists  of  code  selection,  code  scheduling,  register  allocation,  and  code  emission.  The 
code  selection  maps  the  IR  into  an  internal  instruction  representation  that  is  very  similar 
to  the  target  machine  code.  The  code  selection  phase  uses  virtual  registers  to  hold  values 
of  expressions,  variables,  and  constants.  The  code  scheduling  phase  reorders  instructions 
to  exploit  ILR  Register  allocation  then  maps  virtual  registers  to  physical  registers.  The 
final  phase,  code  emission,  emits  machine  code.  The  assumption  of  an  infinite  number 
of  registers  simplifies  the  tasks  of  those  phases  because  they  don’t  need  to  apply  machine 
dependent  heuristics  to  constrain  their  tasks. 


1.2  Register  allocation 

The  task  of  register  allocation  is  to  map  virtual  registers  to  a  limited  number  of  physical 
registers.  A  common  approach  is  to  model  the  register  allocation  problem  as  a  graph 
coloring  problem.  The  graph  comprises  nodes  and  edges.  Nodes  represent  live  ranges  of 
virtual  registers;  two  conflicting  (simultaneously  live)  live  ranges  are  connected  by  an  edge. 
The  graph  is  called  the  interference  graph.  The  interference  graph  is  colored  in  such  a 
manner  that  two  conflicting  live  ranges  are  assigned  different  colors  (physical  registers). 

Liveness  analysis  and  reaching  analysis  determine  the  live  range  for  each  virtual  register. 
Live  ranges  that  are  constructed  in  this  manner,  however,  may  comprise  disjoint  segments. 
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resulting  in  an  unnecessarily  high  number  of  conflicts  for  a  live  range.  Renumbering  [13] 
and  Aveb  analysis  [38]  are  two  techniques  to  construct  concise  live  ranges,  which  result  in 
interference  graphs  of  potentially  lower  degree. 

Because  finding  the  minimal  number  of  colors  for  an  arbitrary  graph  is  NP-complete 
[27],  most  register  allocation  approaches  assign  colors  to  live  ranges  in  some  heuristic  order. 
The  coloring  blocks  when  an  approach  fails  to  find  a  legal  color  for  a  live  range.  We  say  that 
the  degree  (or  register  pressure)  of  the  interference  graph  exceeds  N  (the  number  of  colors). 
When  coloring  blocks,  the  compiler  must  somehow  lower  the  degree  of  the  interference 
graph  to  allow  coloring  to  proceed.  Two  major  techniques  have  been  developed  to  lower 
the  degree  of  the  interference  graph. 

Spilling:  A  live  range  Ir  is  assigned  a  location  in  memory,  and  all  references  to  /r  are  done 
by  memory  accesses  (loads/stores),  which  are  referred  to  as  spill  code.  A  spilled 
live  range  is  removed  from  the  interference  graph  since  it  is  no  longer  a  register 
assignment  candidate,  thus  lowering  the  register  pressure. 

Splitting:  Live-range  splitting  segments  a  long  live  range  lr{x)  into  smaller  live  ranges 
lri{x).  Shuffle  code  is  then  needed  to  move  the  data  value  x  when  control  passes 
from  a  segment  lr\  to  another  segment  Ir^.  Splitting  may  reduce  the  degree  of  the 
interference  graph  because  the  smaller  a  live  range,  the  more  likely  it  is  that  the  live 
range  interferes  with  fewer  live  ranges.  The  expectation  is  that  each  lri{x)  has  a 
lower  degree  than  lr{x)  and  that  the  new  graph  can  then  be  colored. 

Simplification  [13]  is  a  common  technique  that  people  use  to  determine  the  coloring 
order  in  which  live  ranges  are  assigned  colors.  Simplification  is  based  on  the  observation 
that  if  a  node  Ir  has  degree  <  A,  then  Ir  can  be  trivially  colored  no  matter  what  colors  are 
assigned  to  the  residual  graph.  Namely,  there  is  always  a  legal  color  for  Ir  even  if  each  of 
Ir's  neighbors  is  assigned  a  different  color  because  at  most  N  ~  1  colors  are  assigned  to  Ir's 
neighbors.  Such  a  node  with  degree  less  than  N  is  called  unconstrained.  Simplification 
proceeds  by  successively  removing  unconstrained  nodes  from  the  graph.  Each  time  a  node 
Ir  is  removed  from  the  graph,  the  edges  that  are  incident  upon  Ir  are  also  removed,  and 
the  degrees  of  /r's  neighbors  are  decremented.  Once  all  nodes  have  been  removed  from 
the  graph,  colors  are  assigned  to  nodes  in  the  reverse  order  in  which  they  were  removed. 
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Simplification  blocks  when  all  remaining  nodes  have  degrees  >  N,  and  at  this  time  a  live 
range  is  picked  to  be  spilled  based  on  a  heuristic  cost  function.  Once  the  spilled  live  range 
is  removed  from  the  graph,  simplification  then  may  proceed  where  it  blocked  because  the 
degrees  of  the  neighbors  of  the  spilled  live  range  are  lowered. 

Figure  1.1(a)  illustrates  a  sequence  of  simplifying  an  interference  graph  with  N  —  3. 
The  graph  has  6  live  ranges.  Shaded  live  ranges  are  unconstrained.  Initially,  only  Ir^ 
is  unconstrained  (degree  =  2),  The  first  step  removes  Iri  from  the  graph.  Ir2  becomes 
unconstrained  after  removing  Ir^  and  then  is  removed  from  the  graph.  At  this  point  of 
time,  there  are  two  unconstrained  live  ranges,  Ir^  and  Irs,  Removing  either  one  of  them  is 
legitimate  (let’s  say  Ir^),  After  removing  all  remaining  nodes  become  unconstrained 
and  can  be  removed  in  any  order.  Assume  that  /r4,  Irs,  and  Irs  is  the  sequence  of  simplifying 
the  remaining  graph.  Figure  1.1(b)  shows  the  sequence  of  assigning  colors  to  live  ranges 
in  the  reverse  order  in  which  they  were  removed.  Each  hatch  pattern  indicates  a  color. 

1.2.1  Overview 

There  are  9  chapters  in  this  thesis.  The  thesis  starts  off  with  a  review  of  prior  approaches 
and  addresses  the  challenges  of  register  allocation  (Chapter  2).  We  then  look  at  the 
infrastructure  used  to  implement  and  evaluate  the  thesis  ideas  (Chapter  3).  This  thesis 
introduces  a  new  approach  to  register  allocation  that  is  based  on  graph  fusion.  Details  and 
some  important  techniques  of  graph  fusion  are  presented  (Chapter  4).  Nowadays,  compilers 
performs  aggressive  code  transformations  to  improve  the  code  quality.  It  is  very  important  to 
evaluate  how  register  allocation  (the  fusion-based  approach)  and  other  code  transformations 
interact  (Chapter  5).  The  thesis  also  describes  and  evaluates  some  enhancements  of  register 
allocation  (Chapter  6).  Chaitin-style  register  allocation  with  the  enhancements  serves  as 
the  base  line  of  our  comparison  in  the  thesis.  We  evaluate  the  improved  Chaitin-style 
register  allocation  with  some  prior  approaches  (Chapter  7).  Ultimately,  the  thesis  discusses 
how  the  fusion-based  approach  can  be  incorporated  with  other  existing  register-allocation 
techniques  (Chapter  8)  and  then  presents  the  conclusions  (Chapter  9). 
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Chapter  2 
Background 


In  this  chapter,  I  present  the  cost  model  for  register  allocation  (Section  2.1),  review  prior 
approaches  to  register  allocation  (Section  2.2),  address  shortcomings  of  the  prior  approaches 
(Section  2.3),  and  describe  the  challenges  that  a  register  allocator  must  deal  with  so  as  to 
obtain  good  quality  register  allocation. 

2.1  Cost  model  for  register  allocation 

Our  basic  machine  model  is  a  RISC  processor  that  requires  the  operands  of  all  operations 
to  reside  in  registers.  All  our  data  are  based  on  the  code  generator  for  the  MIPS  R3000  ar¬ 
chitecture.  The  register  allocation  cost  includes  all  overhead  operations  that  move  operands 
in  and  out  of  a  register;  this  cost  includes,  e.g.,  also  overhead  operations  that  move  values 
from  one  register  to  another.  That  is,  we  compare  the  results  of  a  register  allocator  against 
a  perfect  allocation  with  unbounded  registers.  The  higher  the  cost  (for  a  fixed  number  of 
registers),  the  worse  the  register  allocator  has  performed. 

The  register  allocation  cost  is  the  sum  of  three  components:  (i)  spill  cost  (to  move  values 
to  and  from  memory),  (ii)  call  cost  (to  free  up/restore  registers  upon  procedure  entry/exit), 
and  (iii)  shuffle  cost  (moving  values  from  one  live  range  to  another). 

Register  allocation  cost  must  include  the  call  cost.  If  the  register  allocator  focuses  on 
the  spill  cost  alone,  the  evaluation  of  the  register  allocator  may  be  overly  optimistic.  The 
call  cost,  however,  is  influenced  by  the  compiler’s  calling  convention.  Many  compilers 
divide  the  registers  into  two  sets,  callee-save  and  caller-save  registers,  respectively.  The 
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distinction  between  these  registers  provides  the  register  allocator  with  more  choices  when 
minimizing  the  call  overhead  [16].  There  is  a  distinct  cost  associated  with  each  kind  of 
register  assigned  to  a  live  range.  Assume  that  there  is  a  program  that  includes  3  functions, 
car,  bar,  and  f  oo.  Function  car  calls  bar,  and  function  bar  calls  f  oo.  When  a  live 
range  Ir  in  bar  ends  up  in  a  caller-save  register  and  contains  the  function  call  f  oo  { ) , 
we  must  pay  the  cost  of  saving  and  restoring  /r’s  value  around  f  oo  { ) .  For  instance,  x 
in  Figure  2.1  (a)  resides  in  caller-save  register  $sr.  The  function  call  f  oo  ( )  divides 
because  the  register  allocator,  without  inter-procedural  analysis,  assumes  that  f  oo  uses  all 
caller-save  registers.  If  Ir  ends  up  in  a  callee-save  register,  then  this  register  must  be  saved 
(restored)  at  the  entry  to  (exit  from)  bar  (Figure  2.1  (b)),  because  the  register  allocator 
assumes  that  car  demands  all  callee-save  registers  after  the  exit  of  bar.  In  Chapter  6, 1 
present  some  heuristics  that  improve  the  quality  of  register  allocation  by  reducing  call  cost. 
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(a)  X  resides  in  caller-save  register  $sr  (b)  x  resides  in  callee-save  register  $pr 


Figure  2.1:  Caller-save  and  callee-save  costs. 


A  live-range  splitting  approach  may  partition  a  live  range  into  smaller  segments  (smaller 
live  ranges).  Shuffle  code  is  required  at  the  splitting  point  to  move  value  between  two  live 
ranges.  Shuffle  code  can  be  of  three  types: 


•  shuffle-move  (register-to-register):  A  move  instruction  shuffles  the  value  between  the 
two  live  ranges  that  are  assigned  to  different  registers. 
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•  shuffle-load  (memory- to-register):  A  load  instruction  moves  the  value  from  one 
spilled  live  range  to  the  other  live  range. 

•  shuffle-store  (register- to-memory):  A  store  instruction  moves  the  value  from  one 
live  range  to  the  other  spilled  live  range. 

There  is  no  need  for  memory-to-memory  shuffle  code  because  the  two  live  ranges  of 
the  same  virtual  register  are  always  spilled  to  the  same  memory  location. 

2.2  Prior  work 

A  number  of  improvements  over  Chaitin- style  register  allocation  have  been  suggested. 
There  are  two  perspectives  to  discuss  prior  approaches.  We  can  focus  on  which  kind  of 
register  overhead  they  attempt  to  reduce  or  address  their  shortcomings.  We  discuss  the  two 
perspectives  separately.  In  this  section,  I  investigate  the  major  known  register-allocation 
strategies,  with  a  focus  on  which  kind  of  register  overhead  they  attempt  to  reduce.  In 
Section  2.3,  I  address  shortcomings  of  some  approaches  that  are  relevant  to  our  register 
allocator. 

2.2.1  Optimal 

Goodwin  and  Wilken  take  the  most  aggressive  approach  [29]  that  formulates  the  regis¬ 
ter  allocation  problem  as  a  0-1  integer  programming  problem.  The  approach  takes  the 
spill,  shuffle,  and  call  costs  into  consideration  and  yields  an  optimal  result  under  some 
assumptions  (e.g.,  the  register  allocator  does  not  reorder/reschedule  instructions  or  perform 
inter-procedural  register  allocation). 

2.2.2  Reducing  spill  cost 

In  essence,  there  exist  two  ways  to  reduce  spill  cost  (i)  finding  better  nodes  to  spill  (i.e., 
better  spilling  heuristics)  and  (ii)  splitting  live  ranges  (in  the  expectation  that  the  smaller 
live  ranges  interfere  with  fewer  other  live  ranges). 
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Spilling  heuristic 

One  of  the  main  goals  of  register  allocation  is  finding  a  way  to  assign  registers  to  all  live 
ranges  of  the  interference  graph.  The  register  allocator  must  spill  live  range(s)  to  memory  if 
the  register  pressure  exceeds  N.  The  register  pressure  is  the  minimum  number  of  registers 
that  the  register  allocator  needs  for  assigning  registers  to  all  live  ranges.  Because  the 
register  allocation  problem  is  NP  hard  [27],  the  minimum  number  of  registers  that  the 
register  allocator  needs  depends  on  the  register  allocation  approach — this  number  may  not 
be  the  same  as  the  theoretical  minimum  number  of  colors  for  the  interference  graph.  The 
heuristic  that  chooses  which  live  ranges  to  spill  has  a  direct  effect  on  the  quality  of  register 
allocation,  so  various  heuristics  have  been  investigated. 

•  Optimistic  coloring  [9]  is  based  on  simplification.  Recall  that  simplification  is  a 
way  to  guarantee  that  there  must  be  registers  for  the  live  ranges  that  are  removed 
from  the  interference  graph.  Simplification  is  a  heuristic  approach  to  coloring  which 
removes  only  live  ranges  whose  degrees  are  less  than  N,  and  as  such  may  miss  legal 
coloring  opportunities  (too  pessimistic).  Optimistic  coloring  improves  simplification 
by  attempting  to  assign  colors  to  live  ranges  that  would  have  been  spilled  by  the  basic 
algorithm.  Optimistic  coloring  delays  spilling  decisions  until  the  color  assignment 
phase.  Spilling  decisions  are  made  during  the  color  assignment  phase:  when  no  legal 
color  exists  for  alive  range  to  be  colored  (instead  of  when  simplification  blocks),  this 
live  range  is  spilled. 

•  Priority-based  coloring  [16]  assigns  registers  to  live  ranges  based  on  a  priority  func¬ 
tion.  The  priority  function  captures  the  savings  in  memory  accesses  from  assigning 
a  register  to  a  live  range  rather  than  keeping  the  live  range  in  memory.  The  approach 
makes  sure  to  assign  registers  to  the  most  important  live  ranges  and  to  spill  the  least 
important  ones  if  necessary. 

•  Probabilistic  register  allocation  [54],  like  priority-based  coloring,  also  assigns  regis¬ 
ters  to  live  ranges  in  a  priority  manner.  Variables  initially  reside  in  memory  instead 
of  virtual  registers  (with  load/store  for  every  use/definition).  The  priority  function  is 
the  probability  that  a  variable  x  will  reside  in  a  register  at  a  given  use  (load)  multi¬ 
plied  by  the  benefit  (savings)  that  this  use  accesses  x'$  value  from  a  register  rather 
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than  memory.  The  inverse  of  the  probability  roughly  indicates  the  distance  between 
definitions  of  the  value  and  its  use.  The  longer  the  distance,  the  more  likely  it  is  that 
the  register  allocator  assigns  registers  to  other  live  ranges  between  the  definitions  and 
the  use — therefore  the  lower  the  probability  that  the  value  resides  in  a  register. 

•  The  register  allocation  algorithm  used  in  the  RS/6000  compiler  [5]  improves  on 
Chaitin’s  basic  algorithm  in  two  ways.  First,  the  interference  graph  is  colored  three 
times,  each  time  using  a  variation  of  the  spilling  heuristic  (i.e.,  a  different  cost 
function),  and  the  coloring  resulting  in  the  least  total  spill  cost  is  selected.  Second, 
when  a  live  range  Ir  is  selected  for  spilling,  instead  of  inserting  a  load  before  each 
use  and  a  store  after  each  definition  of  Ir,  the  register  allocator  uses  the  cleaning 
technique  that  attempts  to  insert  at  most  a  single  load  and  store  inside  each  basic 
block.  In  effect,  Ir  is  split  into  segments  that  span  at  most  one  basic  block. 

•  Rematerialization  [10]  spills  live  ranges  that  are  cheaper  to  recompute  than  to 
store/load  them  back/from  memory,  e.g.,  loading  constant  values  or  computing  ad¬ 
dresses. 

Live-range  splitting 

Several  approaches  have  tried  to  split  live  ranges.  The  expense  of  splitting  is  that  shuffle 
code  must  be  inserted  when  a  live  range  is  broken  into  independent  parts,  since  each  part  can 
be  either  in  a  different  register  or  even  in  memory.  The  motivation  is  to  reduce  the  degree 
of  the  interference  graph  and  to  allow  the  spilling  of  only  those  live  range  segments  that 
span  program  regions  of  high  register  pressure.  Several  more  recent  approaches  to  register 
allocation  attempt  to  make  graph  coloring  sensitive  to  program  structure  by  dividing  a 
program  into  regions  and  prioritizing  the  regions  according  to  execution  probabilities.  The 
register  allocator  then  colors  regions  in  order  of  their  priorities,  and  shuffle  code  is  inserted 
at  the  boundaries  of  these  regions. 

•  Priority-based  coloring  [16]  assigns  colors  to  live  ranges  in  a  heuristic  order  deter¬ 
mined  by  the  priority  function.  Before  colors  are  assigned,  unconstrained  (degree 
<  N)  live  ranges  are  removed  from  the  interference  graph,  since  unconstrained  live 
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ranges  can  always  be  assigned  legal  colors  after  colors  have  been  assigned  to  other 
live  ranges.  Color  assignment  blocks  when  no  legal  color  exists  for  a  live  range  Ir 
to  be  colored,  i.e.,  when  all  N  colors  have  been  taken  up  by  Ir's  neighbors.  At  this 
point,  Ir  is  split. 

Larus  and  Hilfinger  implement  the  priority-based  approach  in  the  SPUR  Lisp  com¬ 
piler  [46]. 

•  Briggs  [7]  uses  the  Static  Single  Assignment  (SSA)  representation  of  a  program  to 
determine  splitting  points.  Splitting  live  ranges  is  taking  place  prior  to  coloring.  A 
live  range  is  split  at  a  node  when  the  incoming  values  to  the  6  node  result  from 
distinct  assignments.  The  approach  also  splits  all  live  ranges  that  span  a  loop  by 
splitting  these  live  ranges  immediately  before  and  after  the  loop. 

•  The  Tera  compiler  (as  described  in  [12])  constructs  a  tile  tree  for  a  program;  this 
tree  corresponds  to  the  control-flow  hierarchy  of  the  program.  Register  allocation 
colors  the  tiles  in  two  phases.  The  first  phase  traverses  the  tile  tree  from  the  bottom 
up  and  allocates  pseudo  registers  to  the  live  ranges  in  each  tile  using  graph  coloring. 
The  second  phase  walks  through  the  tile  tree  top-down  and  binds  pseudo  registers  to 
physical  registers.  As  coloring  is  performed  hierarchically,  shuffle  code  tends  to  be 
outside  of  the  tiles  at  the  bottom  of  the  tree. 

•  The  RAP  compiler  [53]  colors  the  region  nodes  in  a  function’s  Program  Dependence 
Graph  (PDG),  proceeding  in  a  hierarchical  manner  from  the  leaves  to  the  root. 
Chaitin’s  algorithm  is  used  at  each  region  node.  This  approach  splits  live  ranges 
on  region  boundaries.  A  region  is  a  collection  of  predicates  and  statements  that  are 
executed  under  the  same  control  conditions. 

•  The  Multiflow  compiler  employs  trace  scheduling  as  a  framework  for  both  register 
allocation  and  scheduling  [25].  The  trace  scheduler  picks  a  trace  and  then  passes 
it  to  the  code  scheduler;  the  code  scheduler  then  performs  register  allocation  and 
scheduling  together.  The  code  scheduler  records  register  usage  preferences  for  the 
scheduled  trace;  this  information  is  maintained  for  each  exit  from  or  entry  into  a 
trace.  This  information  is  subsequently  used  when  translating  traces  that  connect 
to  these  exit  or  entry  points,  and  shuffle  code  between  two  traces  is  not  needed  if 
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the  value  can  be  kept  in  the  same  register.  Traces  that  are  compiled  first  have  more 
freedom  in  using  registers,  and  shuffle  code  ends  up  on  boundaries  to  traces  that  are 
compiled  later.  As  long  as  the  trace  picker  presents  the  traces  in  an  order  that  reflects 
the  execution  frequency,  this  scheme  favors  the  most  frequently  executed  parts  of  a 
program. 

•  The  IMPACT  compiler  takes  a  similar  approach  as  the  Multiflow  compiler  [34]. 
The  compiler  selects  a  region  and  performs  classical  global  optimizations,  ILP  op¬ 
timization,  code  scheduling,  and  register  allocation.  Like  the  Multiflow  approach, 
regions  that  are  compiled  first  have  more  freedom  in  using  registers;  a  similar  binding 
technique  is  used  to  eliminate  the  shuffle  code  on  region  boundaries. 

•  Probabilistic  register  allocation  [54]  is  a  hybrid  of  the  priority-based  and  program 
structure  based  approaches.  The  approach  consists  of  three  steps,  local  register 
allocation,  global  register  allocation,  and  register  assignment.  The  global  register 
allocation  step  partitions  a  program  into  regions  based  on  the  loop  hierarchy  and 
proceeds  from  the  innermost  loops  to  the  outermost  loops.  When  a  variable  is 
assigned  to  a  register,  shuffle  code  is  placed  in  the  pre-header  and  post-exit  of  a  loop 
to  load  the  value  of  the  variable  into  the  register  and  restore  it  back  to  memory  (if  the 
value  is  updated). 

•  Coagulation  code  generation  [50]  integrates  code  generation  and  register  allocation 
based  on  the  ordering  of  prioritized  control  flow  edges.  The  most  important  (fre¬ 
quently  executed)  regions  have  the  maximum  freedom  in  using  registers,  like  in  the 
Mulitflow  compiler  [25]. 

•  Kurlander  and  Fischer  [45]  perform  live-range  splitting  after  register  allocation  to 
free  up  registers  that  can  be  used  to  improve  code  scheduling.  Empty  delay  slots  in 
the  final  schedule  are  filled  with  shuffle  code  to  split  and  spill  live  ranges.  Spilling 
frees  up  registers  and  these  additional  registers  are  used  to  remove  false  dependencies 
induced  by  the  reuse  of  registers. 

•  Kolte  and  Harrold  construct  live  ranges  at  a  very  fine  granularity,  essentially  at  every 
reference  of  live  ranges  [44]. 
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2.2.3  Reducing  shuffle  code 

Most  live-range  splitting  approaches  [12, 53,  25,  50, 54]  reduce  shuffle  cost  by  prioritizing 
a  function  into  regions  and  thereby  performing  register  allocation  in  a  hierarchical  manner 
so  as  to  place  shuffle  code  in  less  important  regions.  Some  approaches  [7,  12,  53,  25] 
rely  on  some  similar  techniques  such  as  biased  coloring  [7]  to  eliminate  a  shuffle  move  by 
assigning  the  same  color  to  the  two  live  ranges  that  the  move  connects. 

2.2.4  Reducing  call  cost 

A  register  may  not  be  free  at  procedural  boundaries,  at  call  sites  (caller-save)  or  at  the 
entries/exits  of  procedures  (callee-save).  The  cost  of  saving/restoring  registers  dominates 
the  register  overhead  as  the  number  of  available  registers  increases.  The  register  allocator 
can  reduce  the  call  cost  intra-procedurally  or  inter-procedurally. 


Intra-procedural  allocation 

The  register  allocator  analyzes  only  the  current  function  without  any  knowledge  about 
calling  or  called  functions.  Selecting  caller-save  or  callee-save  registers  for  live  ranges  and 
optimizing  placement  of  caller-save  and  callee-save  code  are  two  common  approaches  to 
reduce  the  call  cost. 

•  The  priority  function  used  in  the  priority-based  approach  [16]  considers  both  caller- 
save  and  callee-save  cost.  Register  allocation  assigns  the  kind  of  register  to  a  live 
range  which  yields  the  maximum  savings.  If  no  register  of  the  best  type  is  available, 
the  register  allocator  assigns  the  other  kind  of  register  to  the  live  range.  The  approach 
spills  live  ranges  whose  savings  for  the  assigned  registers  are  negative. 

•  The  generic  callee-save  convention  saves/restores  the  values  of  callee-save  registers 
at  the  exit/entry  of  the  function.  The  code  to  save  and  restore  is  executed  inevitably 
every  time  the  function  is  invoked,  so  there  may  be  redundant  saves  and  restores  if 
some  callee-save  registers  are  never  used  during  the  execution  of  the  function.  Chow 
uses  shrink-wrapping  [17]  to  move  the  callee-save  savings  and  restorings  close  to  the 
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places  where  the  callee-save  registers  are  used.  Hence  save  and  restore  operations 
are  performed  only  when  it  is  absolutely  necessary. 

•  The  “Chez  Scheme”  compiler  optimizes  the  placement  of  caller-save  code  [55], 
Measurements  show  that  over  two  thirds  of  procedural  activations  actually  make  no 
calls  [55],  The  compiler,  hence,  favors  using  caller-save  registers  along  the  paths  that 
contain  no  calls;  it  uses  callee-save  registers  along  the  paths  where  function  calls  are 
inevitable.  In  other  words,  live  ranges  are  split  at  the  edge  <P,S>  where  one  (P/S)  can 
reach  the  exit  without  making  a  call  and  all  paths  from  the  other  one  (S/P)  to  the  exit 
contain  calls.  Shuffle  code  (a  register  move)  is  needed  to  move  a  value  between  two 
different  classes  of  register  at  these  splitting  points. 

Inter-procedural  allocation 

Inter-procedural  register  allocation  usually  assigns  registers  in  a  hierarchical  manner  over 
the  call  graph  such  that  register  allocation  for  a  function  /  can  take  the  register-usage 
information  of  functions  h  that  are  called  by  /  into  account, 

•  Wall  defers  register  allocation  until  link  time  [61].  Variables  that  are  not  simultane¬ 
ously  live  (inter-procedurally)  are  grouped  together  to  use  the  same  physical  register. 
Variables  of  a  procedure  are  never  assigned  to  the  same  group  as  the  variables  of 
descendants  on  the  call  graph,  so  that  the  variables  of  /  get  different  registers  from 
g  and  h  (/’s  descendants).  Save/restore  operations  of  registers  across  procedures, 
therefore,  are  unnecessary.  The  code  generated  by  the  code  generation  phase  is  an¬ 
notated  so  that  the  linker  knows  what  instructions  must  be  deleted  and  can  perform 
relocation  once  the  register  allocator  assigns  a  register  to  a  variable. 

•  Chow  extends  priority-based  coloring  to  inter-procedural  register  allocation  that  con¬ 
structs  a  call  graph  and  compiles  functions  from  leaves  to  the  root  [17].  The  usage 
information  of  callee-save  registers  is  propagated  to  the  upper  regions  of  the  call 
graph.  Function  /  tries  to  avoid  using  the  same  callee-save  registers  used  by  g  and  h 
such  that  save/restore  operations  of  the  registers  become  redundant  and  can  therefore 
be  eliminated. 
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•  Steenkiste  develops  an  inter-procedural  register  allocator  in  the  context  of  a  LISP 
compiler[58].  Because  LISP  programs  tend  to  spend  most  of  time  in  the  bottom  of 
the  call  graph,  register  allocation  is  performed  from  the  leaves  to  the  root  over  the  call 
graph,  like  Chow’s  approach  [17].  Different  registers  (not  used  by  the  descendants 
of  the  current  function  in  the  call  graph)  are  assigned  to  live  ranges  of  the  current 
function. 

•  The  HP  compiler  [56]  performs  inter-procedural  register  allocation  using  global- 
variable  promotion  and  spill-code  motion.  Global-variable  promotion  transforms 
memory  accesses  to  global  variables  into  register  references  inside  clusters  of  func¬ 
tions  in  the  call  graph.  Spill-code  motion  elevates  save/restore  code  for  callee-save 
registers  to  infrequently  executed  functions. 

2.3  Shortcomings  of  prior  work 

In  the  previous  section,  we  study  how  prior  approaches  tackle  each  kind  of  register  over¬ 
head.  In  this  section,  we  look  at  prior  approaches  from  the  other  perspective,  i.e.,  how  well 
they  perform  in  terms  of  reducing  overall  register  overhead.  Because  inter-procedural  allo¬ 
cation  is  not  implemented  in  our  register  allocation  framework,  I  drop  the  inter-procedural 
approaches  out  of  the  discussion  here.  I  classify  some  prior  approaches  that  attempt  to 
reduce  the  spill  and  shuffle  costs  into  5  categories:  Chaitin-style  coloring,  optimal  register 
allocation,  splitting  prior  to  coloring,  splitting  during  coloring,  and  pro  gram  structure  based. 

2*3.1  Chaitin-style  coloring 

The  RS/6000  compiler  [5]  and  optimistic  coloring  [9]  improve  Chaitin’s  basic  algorithm. 
The  former  tunes  the  spilling  heuristic,  and  the  latter  delays  spilling  decisions  so  as  to 
assign  registers  to  more  live  ranges  than  without  delaying.  The  main  problem  with  these 
two  approaches  is  that  the  spilling  decisions  are  all-or-nothing:  all  definitions  and  uses  of  a 
spilled  live  range  go  through  memory  even  though  some  parts  of  the  live  range  could  have 
been  allocated  a  register.  It  is  more  beneficial  to  spill  only  the  troublesome  segments  of  a 
live  range  (i.e.,  segments  that  contain  few  or  no  references  and  span  regions  of  high  register 
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pressure),  while  keeping  in  registers  those  segments  that  have  references  in  regions  of  high 
execution  frequency. 

2.3.2  Optimal  register  allocation 

Optimal  register  allocation  [29]  is  an  attractive  approach  which  produces  an  optimal  result. 
The  major  problem  of  the  approach  is  the  complexity  of  the  algorithm  compilation  time 
and  the  resulting  memory  usage.  Modern  compilers  perform  aggressive  scope  enhancing 
transformations  such  as  loop  unrolling  and  function  inlining  so  as  to  increase  ILP  and 
exploit  more  optimization  opportunities.  Optimal  register  allocation  does  not  scale  well 
in  the  presence  of  scope  enhancing  transformations.  However,  as  the  performance  of 
processors  keeps  being  doubled  every  18  months,  and  faster  integer  programming  solvers 
appear,  the  approach  may  become  practical  in  the  future. 

2.3.3  Splitting  prior  to  coloring 

Briggs  [7]  determines  splitting  points  using  the  SSA  representation  of  a  program.  The 
approach  also  splits  live  ranges  that  span  a  loop  at  the  loop  boundary.  Kolte  and  Harrold 
[44]  partition  a  live  range  at  a  finer  granularity  by  considering  the  ranges  of  instructions 
between  loads  and  stores  of  a  virtual  register.  These  approaches  make  splitting  decisions 
prior  to  coloring. 

There  are  a  couple  of  drawbacks  in  regard  to  making  splitting  decisions  too  early  (prior 
to  coloring).  First,  decisions  regarding  which  live  ranges  to  be  split  and  where  to  split 
them  are  made  prematurely.  Thus,  live  ranges  may  be  split  unnecessarily,  resulting  in  a 
performance  degradation  due  to  unnecessary  shuffle  code.  Various  heuristics  have  been 
developed  to  eliminate  shuffle  code  by  increasing  the  chance  that  the  same  color  is  given 
to  partner  live  ranges,  e.g.,  biased-coloring  or  conservative  coalescing  [7].  Rather  than 
determining  the  critical  regions  where  splitting  and  spilling  are  beneficial,  these  approaches 
split  live  ranges  arbitrarily  and  greedily,  with  the  hope  that  later  heuristic  steps  will  clean 
up  unnecessary  splits.  Second,  the  places  where  live  ranges  are  split  by  these  approaches 
are  not  necessarily  the  places  of  low  execution  probability.  Although  these  approaches  may 
use  execution  probabilities  to  splitting  points,  there  is  no  guarantee  that  the  resulting  live 
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ranges  will  fit  into  registers  without  spilling.  That  is,  the  register  allocator  runs  the  risk 
of  either  splitting  too  much  (leading  to  unnecessary  shuffle  code),  or  not  enough  (with  the 
consequence  that  high-frequency  live  ranges  are  spilled). 

Even  though  Kolte  and  Harrold  attempt  to  construct  live  ranges  at  the  granularity  of 
instructions  between  loads  and  stores,  live  ranges  may  be  long  under  some  circumstances. 
Consider  a  code  for  x  that  contains  only  one  definition  and  use.  Only  one  live  range  is 
constructed  for  x.  The  definition  and  use  may  be  far  apart,  and  the  live  range  contains  a  lot 
of  blocks  with  high  register  pressure.  Splitting  the  live  range  is  not  possible  even  though 
splitting  the  parts  of  the  live  range  that  have  high  register  pressure  is  desirable.  Furthermore, 
it  is  unclear  how  they  deal  with  the  issue  of  eliminating  shuffle  code. 


2,3*4  Splitting  during  coloring 

The  priority-based  coloring  [16]  approach  is  an  alternative  framework  that  allows  splitting 
decisions  to  be  delayed  until  coloring  blocks.  Colors  are  assigned  to  live  ranges  in  a  priority 
fashion.  When  there  is  no  legal  color  for  a  live  range  Ir  (coloring  blocks),  the  approach 
splits  /r.  To  facilitate  splitting,  the  live  range  for  a  candidate  x  is  defined  as  a  collection 
of  live  units,  where  each  live  unit  is  a  basic  block  within  which  x  is  live.  Splitting  forms  a 
new  live  range  lr*{x)  by  starting  from  a  seed  live  unit  and  incrementally  adding  live  units 
to  the  new  live  range  until  adding  one  more  live  unit  renders  lr\x)  uncolorable.  Live  units 
are  added  in  a  breadth-first  traversal  of  the  control  flow  graph,  preferably  starting  from  a 
live  unit  where  the  first  reference  to  x  is  a  definition.  A  live  range  is  spilled  (i.e.,  remains 
in  its  home  location)  when  no  live  units  that  comprise  the  live  range  can  be  given  a  register. 

An  important  consideration  in  live-range  splitting  is  selecting  splitting  points.  To 
reduce  shuffle  cost,  shuffle  code  should  be  placed  at  points  of  low  execution  probability. 
The  priority  based  approach  neither  takes  execution  frequency  nor  program  structure  into 
account  when  splitting  live  ranges,  and  there  is  no  guarantee  that  splitting  points  do  not  end 
up  along  frequently  executed  edges.  Shuffle  code  induced  by  a  split  may  end  up,  e.g.,  on 
a  loop  back  arc.  Code  motion  techniques  are  used  after  color  assignment  to  optimize  the 
placement  of  the  shuffle  code. 
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2.3.5  Program  structure  based 

Program  structure  based  approaches  [12, 53, 25, 34, 50, 54]  attempt  to  make  graph  coloring 
sensitive  to  program  structure  by  dividing  a  program  into  regions  and  prioritizing  the  regions 
according  to  execution  probabilities.  The  register  allocator  then  colors  regions  in  the  order 
of  their  priorities,  and  shuffle  code  is  inserted  at  the  boundaries  of  these  regions.  As  coloring 
is  performed  hierarchically,  the  higher  priority  regions  have  more  freedom  in  using  registers, 
and  shuffle  code  tends  to  be  outside  the  higher  priority  regions.  As  long  as  the  regions 
reflect  the  execution  frequency,  they  favor  the  most  frequently  executed  parts  of  a  program. 

Those  approaches  all  make  early  binding  decisions  when  coloring  is  performed  to  a 
region — either  assign  real  physical  registers  to  live  ranges  or  determine  which  live  ranges 
should  share  the  same  registers.  Early  binding  decisions  impose  unnecessary  constraints 
to  lower  priority  regions  because  the  lower  priority  regions  must  bind  to  the  already-made 
coloring  decisions. 


Example  of  early  binding 

Let  us  consider  the  Tera  approach  [12]  as  an  example.  Figure  2.2  illustrates  unnecessary 
constraints  imposed  on  register  allocation  by  premature  coloring  decisions.  The  approach 
constructs  a  tile  tree  for  a  program.  The  code  in  Figure  2.2(a)  consists  of  two  tiles,  one  for 
the  loop  (the  shaded  region)  and  one  for  blocks  Z?i  and  Be.  Recall  that  the  Tera  compiler 
performs  register  allocation  for  the  tiles  in  two  phases.  The  first  phase  colors  each  tile  in  a 
bottom  up  fashion  (from  the  bottom  of  the  tile  tree  up  to  the  root  tile).  Pseudo  registers  are 
assigned  to  live  ranges  in  the  first  phase.  Once  two  variables  in  a  tile  are  assigned  to  the 
same  pseudo  register,  the  parents  of  the  tile  must  adhere  to  this  decision. 

The  loop  tile  is  colored  first.  The  interference  graph  of  the  loop  tile  is  depicted  in  Figure 
2.2(b).  Inside  the  loop  tile,  the  live  range  lr{y)  is  live  in  B2  and  3$,  There  are  two  live 
ranges  for  x,  lr{x)  and  lr'{x),  in  the  loop.  The  live  range  lr{x)  is  live  in  Bi  and  3$,  and 
lr\x)  is  live  in  B4,  Coloring  the  graph  with  two  registers,  lr'{x)  is  placed  into  the  same 
pseudo  register  as  either  /r(^’)  or  lr{y).  Based  on  the  information  inside  the  loop  region, 
both  choices  are  reasonable.  However,  if  lr'{x)  and  lr{y)  use  the  same  pseudo  register,  a 
shuffle  move  is  inevitable  on  either  (i?4,  Be)  or  (ft,  ft),  since  lr{x)  and  lr*{x)  merge  in 
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Z?6,  The  notation  i?„)  represents  a  control-flow  edge  that  goes  from  block  to  Z?h. 
If  lr*(x)  and  lr{x)  use  the  same  pseudo  register,  no  shuffle  code  is  required,  but  that  this 
mapping  is  beneficial  can  only  be  determined  when  dealing  with  Be. 


2.4  Challenges 

There  are  several  challenges  to  register  allocation.  The  compiler’s  decisions  to  those 
challenges  influence  the  quality  of  register  allocation. 

•  coloring  order.  In  which  order  are  live  ranges  assigned  colors. 

•  spilling  versus  splitting:  When  the  register  allocator  perceives  that  register  pressure 
is  high  and  ought  to  take  some  actions  to  lower  the  register  pressure.  Both  spilling 
and  splitting  live  ranges  reduce  the  pressure.  The  register  allocator  has  to  choose 
which  technique  to  use  to  lower  the  register  pressure.  Which  live  ranges  are  spilled 
or  split? 

•  splitting  points:  In  the  case  of  splitting  live  ranges,  how  does  the  register  allocator 
determine  splitting  points  for  the  live  ranges? 
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•  reconciliation:  How  to  reconcile  register  allocation  of  two  regions  is  a  fundamental 
challenge  to  the  program  structurebased  approaches.  If  the  register  allocator  performs 
register  allocation  for  a  region  and  doesn’t  reconcile  the  region  with  other  regions, 
excessive  shuffle  code  may  arise  at  region  boundaries. 

•  call  cost:  What  is  the  heuristic  of  assigning  a  callee-save  or  caller-save  register  to  a 
live  range? 

2.5  Preview  of  fusion-based  register  allocation 

In  this  chapter,  I  briefly  review  prior  approaches  on  register  allocation,  and  address  their 
shortcomings.  None  of  the  heuristic  (non-optimal)  approaches  taken  so  far  considers  the 
program  structure  during  both  splitting  and  coloring.  However,  splitting  and  coloring  are 
closely  related  to  each  other.  The  decision  to  allow  some  live  ranges  to  reside  in  the  same 
register  (at  different  times)  affects  subsequent  splitting  and  coloring  decisions.  The  ideal 
model  is  to  postpone  the  splitting  and  register  binding  decisions  as  late  as  possible  so  that 
more  information  can  be  considered. 

Fusion-based  register  allocation  is  a  bottom-up  approach  that  builds  up  the  interference 
graph,  starting  off  with  live  ranges  that  extend  at  most  a  single  region.  Each  region  initially 
has  its  own  interference  graph.  A  region  can  be  as  small  as  a  single  basic  block,  or  as 
large  as  a  whole  function  (in  which  case  the  algorithm  is  identical  to  Chaitin-style  register 
allocation).  Regions  are  connected  via  control-flow  edges,  which  are  prioritized.  These 
edges  are  then  considered  in  priority  order,  and  the  interference  graphs  of  two  regions 
connected  by  an  edge  are  merged  by  fusing  the  interference  graphs.  The  fusion  operator 
coalesces  live  ranges  that  span  the  control  flow  edge  and  maintains  the  invariant  that  the 
resulting  merged  interference  graph  is  colorable,  if  necessary  by  splitting  (suppressing  the 
coalescing)  of  a  live  range.  Thus  the  fusion  operator  makes  spilling  and  splitting  decisions 
when  it  becomes  clear  that  it  is  impossible  or  unprofitable  to  keep  all  live  ranges  in  registers. 
After  all  control  flow  edges  have  been  considered,  a  single  colorable  interference  graph 
remains. 
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Chapter  3 

Experimental  Framework 


In  this  chapter,  I  describe  cmcc,  the  Carnegie  Mellon  C  Compiler,  which  provides  the 
context  for  my  implementation  of  register  allocation.  I  also  describe  the  register  allocation 
framework  that  provides  the  infrastructure  for  various  register  allocation  approaches. 


3.1  The  cmcc  compiler 

The  cmcc  compiler  is  an  optimizing,  retargetable  compiler  for  a  complete  programming 
language  (ANSI  C).  cmcc  consists  of  three  major  phases,  front  end,  global  optimization, 
and  code  generation.  The  front-end  phase  uses  the  Icc  ANSI  C  front  end  [24]  (written  in 
C).  The  code  generation  of  Icc  is  modified  to  generate  cmcc  Intermediate  Representation 
(IR)  operations.  The  global  optimization  and  code  generation  phases  are  coded  in  C++.  The 
global  optimization  phase  accepts  cmcc’s  internal  IR  generated  by  the  front-end  phase  and 
performs  global  optimizations.  Table  3.1  lists  the  global  optimizations  performed  by  cmcc. 
The  code  generation  phase  performs  code  selection,  local  code  scheduling  (list  scheduling), 
and  register  allocation,  cmcc  produces  assembly  language  for  a  variety  of  target  machine 
architectures:  MIPS  [40],  SPARC  [51],  and  iWarp  [6].  The  results  (register  overhead 
operations  and  execution  time)  in  this  dissertation  are  based  on  the  MIPS  architecture. 

Table  3.2  contrasts  the  optimized  code  generated  by  cmcc  with  the  optimized  code 
produced  by  gcc  and  the  native  MIPS  cc  compiler  for  the  MIPS  architecture,  and  with 
the  code  produced  by  the  native  SUN  cc  compiler  for  the  SPARC  architecture.  (This  table 
presents  the  performance  relative  to  these  compilers;  a  number  of  less  than  one  means  that 
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Loop  unrolling  and  peeling 
Linear  function  test  replacement 
Induction  variable  expansion 
Induction  variable  simplification 
Constant  propagation  and  folding 
Induction  variable  elimination 
Assignment  propagation 
Partial  dead  code  elimination 
Dead  assignment  elimination 
Partial  redundancy  elimination 
Strength  reduction 
Branch  optimizations 
Instruction  scheduling 

Global  register  allocation  (using  graph  coloring) 
Register  coalescing _ 

Table  3.1:  Optimizations  performed  by  cmcc 


MIPS 

SUN 

program 

gcc  -02 

cc  -0 

CC  -0 

alvinn 

1.07 

0.94 

0.79 

compress 

0.84 

0.95 

1.04 

ear 

1.08 

0.95 

0.99 

eqntott 

1.14 

1.09 

0.59 

espresso 

1.06 

1.05 

1.02 

li 

0.98  : 

1.05 

1.16 

sc 

1.07 

1.03 

1.00 

Table  3.2:  Performance  of  optimized  code  generated  by  cmcc,  relative  to  optimized  code 
generated  by  gcc  (version  2.3.2)  and  MIPS  cc  on  a  DECstation  5000/200,  and  SUN  cc 
on  a  SPARC. 

cmc  c  produces  better  code.)  The  sample  programs  are  the  C  programs  of  the  SPEC92  suite. 
Overall,  even  without  serious  performance  tuning,  cmcc  produces  code  that  is  roughly  of 
the  same  quality  as  the  code  produced  by  the  other  compilers. 


3.2  The  register-allocation  framework 

In  the  cmcc  compiler,  register  allocation  is  performed  after  code  scheduling.  The  cmcc 
register  allocator  takes  a  scheduled  code  sequence  in  which  instructions  use  virtual  registers 
and  assigns  physical  registers  based  on  graph  coloring. 
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The  cmcc  compiler  implements  a  register  allocation  framework  [1]  for  experimentation 
of  various  register-allocation  approaches.  A  framework  partitions  the  task  of  the  framework 
into  abstract  classes  and  defines  their  responsibilities  and  collaboration  [26].  A  framework 
fosters  code  reuse  by  capturing  the  common  structure  to  its  application  domain.  Various 
applications  that  share  the  same  framework  are  easy  to  maintain  because  the  code  is  concise 
and  any  change  or  extension  to  the  framework  benefits  to  all  applications.  Furthermore,  a 
framework  facilitates  a/arr  comparison  which  is  very  important  forresearches.  For  example, 
if  an  improvement  is  added  to  some  parts  of  register  allocation  (e.g.,  the  construction  of 
the  interference  graph,  or  color  assignment)  then  all  register- allocation  approaches  can 
benefit  from  the  improvement.  Using  a  framework  avoids  a  bias  in  favor  of  a  one  particular 
approach. 


Figure  3.1:  Structure  of  the  register  allocator. 


I  identify  seven  phases  in  the  register  allocator:  graph  construction,  live-range  coa¬ 
lescing,  color  ordering,  color  assignment,  graph  reconstruction,  spill-code  insertion,  and 
shuffle-code  insertion  (as  illustrated  by  Figure  3.1).  This  register- allocation  structure  allows 
us  to  model  a  wide  range  of  register  allocation  approaches.  A  register- allocation  approach 
customizes  the  register- allocation  framework  by  providing  specific  implementations  of  the 
abstract  classes  (subclassing). 


Graph  construction 

The  graph-construction  phase  builds  the  interference  graph  for  the  input  instruction  se¬ 
quence.  This  phase  utilizes  the  data-flow  analysis  framework  [1]  to  compute  the  reaching 
and  live  properties  of  virtual  registers  for  building  live  ranges,  reaching  determines  if  a 
definition  of  a  virtual  register  reaches  the  entry/exit  of  a  basic  block,  live  decides  if  the 
value  of  a  virtual  register  is  used/referenced  after  the  entry/exit  of  a  basic  block.  Various  ap- 
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proaches  may  split  live  ranges  to  lower  register  pressure  while  constructing  the  interference 
graph.  Details  are  presented  in  Chapter  4. 


Coalescing 

The  coalescing  phase  eliminates  copy  instructions  that  move  data  values  between  two  non¬ 
conflicting  live  ranges  by  coalescing  the  two  live  ranges  into  one.  If  a  copy  instruction 
moves  a  value  from  a  live  range  to  a  physical  register,  or  from  a  physical  register  to  a  live 
range  (such  as  calling  convention),  the  coalescing  phase  sets  the  partner  information.  A 
live  range  may  have  several  partners.  The  partner  relationship  indicates  that  a  live  range 
would  like  to  receive  the  same  color  as  one  of  its  partners.  The  later  color-assignment  phase 
can  thereby  take  the  partner  information  into  consideration  and  assign  colors  to  live  ranges 
(biased  coloring  [7]). 

Color  ordering 

There  are  two  data  structures  that  are  used  as  the  interface  between  the  color-ordering  and 
color-assignment  phases:  (1)  the  color  stack  (C),  and  (2)  the  spill  pool  (5).  The  color¬ 
ordering  phase  heuristically  determines  the  order  in  which  live  ranges  are  to  be  assigned 
colors  and  pushes  live  ranges  onto  C  based  on  the  ordering  —  the  higher  the  position  within 
C,  the  higher  the  priority.  S  keeps  all  spilled  live  ranges.  While  deciding  the  color  ordering, 
this  phase  may  make  spilling  decisions  due  to  lack  of  colors  and  may  add  the  spilled  live 
range  to  S,  The  spill-code  insertion  phase  allocates  spaces  on  the  local  heap  for  spilled  live 
ranges  in  S,  and  inserts  actual  spill  code. 


Color  assignment 

The  color- assignment  phase  pops  live  ranges  fromC  and  finds  legal  colors  (non-conflicting 
colors)  for  them,  i.e.  this  phase  assigns  colors  based  on  the  ordering  computed  by  the 
previous  phase  and  uses  the  interference  graph  to  make  sure  that  conflicting  live  ranges  are 
assigned  different  colors.  Biased  coloring,  which  considers  the  partner  information,  is  used 
to  find  legal  colors  so  as  to  eliminate  copy  instructions.  That  is,  a  non-conflicting  partner’s 
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color  of  a  live  range  may  be  assigned  to  the  live  range  to  eliminate  the  copy  instruction 
that  move  the  value  between  the  partner  and  the  live  range.  Details  of  biased  coloring  are 
presented  in  Section  4.5.  If  this  phase  is  unsuccessful  to  find  a  legitimate  color  for  a  live 
range  because  all  colors  are  already  taken  by  its  neighbors,  the  live  range  is  then  spilled  and 
added  to  S, 

The  spilling  that  we  have  discussed  so  far  is  called  out-of-color  spilling  because  live 
ranges  are  spilled  with  the  goal  of  reducing  register  pressure.  Register  allocation  may  also 
make  spilling  decisions  to  reduce  the  overall  overhead  operations  even  though  there  are  still 
enough  colors  for  live  ranges.  This  kind  of  spilling  is  called  call-cost  spilling  (details  are 
discussed  in  Chapter  6). 


Graph  reconstruction 

For  simplicity,  the  color-ordering  and  color-assignment  phases  make  spilling  decisions 
under  the  assumption  that  a  spilled  live  range  does  not  need  any  register  resource.  However, 
for  load/store  machines,  memory  accesses  must  be  done  by  load/store  instructions,  which 
require  to  use  a  register  as  one  operand  (destination/source)  [35].  In  other  words,  at  every 
occurrence  of  spilled  live  ranges,  there  is  an  instant  requirement  of  a  register.  Hence,  spill 
code  needs  to  acquire  registers.  Reserving  registers  for  spill  code  is  one  way  of  making  sure 
that  there  are  registers  for  spill  code.  A  register  allocator  reserving  registers  for  spill  code, 
nevertheless,  works  with  fewer  colors  than  one  without  reserving  registers.  Our  register 
allocator  reserves  no  registers  for  spill  code;  that  is,  the  register  space  is  not  divided  into 
“local”  and  “global”  registers.  To  ensure  that  there  are  registers  for  spill  code,  the  register 
allocator  rebuilds  the  interference  graph  (constructs  live  ranges  for  spill  code)  after  spilling 
live  ranges  and  restarts  from  the  coalescing  phase. 

There  are  two  ways  to  reconstruct  the  interference  graph:  (1)  spill  code  (a  load/store) 
is  inserted  into  the  schedule  immediately  before/after  a  use/definition,  the  interference 
graph  is  thrown  away,  and  the  whole  coloring  process  restarts  from  the  first  phase;  (2) 
the  existing  interference  graph  is  modified  by  producing  a  new  small  live  range  for  each 
occurrence  of  a  spilled  live  range  (no  spill  code  is  inserted;  the  newly  produced  live  ranges 
span  only  one  instruction),  and  the  coloring  process  can  then  skip  the  graph  construction 
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phase  and  proceed  from  the  coalescing  phase  with  the  modified  graph.  Those  newly 
produced  live  ranges  are  called  spilUcode  live  ranges.  The  first  approach  reuses  the  code 
for  constructing  the  interference  graph.  The  second  approach  favors  compilation  speed, 
because  the  interference  graph  is  not  rebuilt  from  scratch,  cmcc  takes  the  second  approach 
(i.e.,  it  reuses  the  graph,  to  save  compilation  time). 


code  interference  code  interference 

. . ,  graph  ^ . . ,  graph 


Figure  3.2:  SpilFcode  live  ranges. 

Figure  3.2  illustrates  an  example  of  spill  code.  Enclosed  in  dotted  lines  are  sequences 
of  code.  What  we  are  interested  in  is  the  live  range  /r^;.  The  add  instruction  defines  x, 
and  the  sub  instruction  is  the  last  instruction  that  uses  a:.  The  shaded  area  in  Figure  3.2(a) 
is  the  range  in  which  x  is  live  before  Ir^  is  spilled.  Any  reference  of  x  within  the  area 
accesses  the  value  of  /r^;.  When  Iva;  is  spilled,  the  first  approach  of  reconstructing  the 
interference  graph  first  inserts  the  two  spill  code  operations  (one  store  and  one  load  as 
shown  in  Figure  3.2(b)).  The  graph  construction  phase  then  rebuilds  the  interference  graph 
for  the  new  code  sequence.  As  a  result,  the  x  defined  in  the  add  instruction  and  the  x 
accessed  by  the  sub  instruction  belong  to  two  different  live  ranges,  and  lr^'\  because 
the  Id  instruction  re-defines  x  (start  of  /r^;").  The  second  approach  of  reconstructing 
graph  constructs  the  new  interference  graph  by  modifying  the  original  interference  graph 
of  Figure  3.2(a).  The  s  t  and  Id  instructions  are  not  inserted  into  the  code  sequence  until 
the  later  spill-code  insertion  phase. 

The  next  pass  of  register  allocation  (restart  from  the  coalescing  phase)  attempts  to  assign 
registers  to  all  the  live  ranges  including  spill-code  live  ranges.  Other  live  ranges  may  be 
spilled  during  this  iteration,  the  register  allocator  repeats  register  allocation  until  no  more 
live  range  is  spilled.  Because  spill-code  live  ranges  span  only  one  instruction,  spilling  them 
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does  not  help  to  lower  register  pressure  but  increases  spill  code  instead.  Therefore,  they  are 
marked  non-spillable  so  that  the  register  allocator  won’t  spill  them  in  subsequent  iterations 
of  register  allocation. 

Figure  3.3  illustrates  an  example  such  that  the  register  allocator  needs  to  perform 
multiple  iterations  of  register  allocation.  A  white  circle  in  the  figure  indicates  a  definition, 
and  a  black  circle  indicates  a  use.  Solid  vertical  lines  indicate  live  ranges  of  virtual  registers. 
The  example  has  two  colors  and  three  live  ranges,  x,  y,  and  The  three  live  ranges  overlap 
(as  shown  in  Figure  3.3(a))  so  they  interfere  with  each  other  and  must  reside  in  distinct 
registers.  Hence,  the  register  allocator  must  spill  one  live  range  (assume  that  x  is  spilled). 
The  live  range  x  is  broken  into  two  spill-code  live  ranges  (depicted  in  Figure  3.3(b)). 
The  dotted  line  between  the  two  spill-code  live  ranges  indicates  the  value  of  x  residing  in 
memory  instead  of  a  register.  Even  though  the  live  range  x  is  spilled  already,  we  see  that 
the  maximum  number  of  register  requirement  is  still  three  at  the  use  of  x  (shaded  range). 
Thus  the  register  allocator  must  spill  another  live  range  (assume  that  5;  is  spilled)  during  the 
second  iteration  of  register  allocation  and  re-start  register  allocation  over  again.  After  5?  is 
spilled,  the  maximum  number  of  register  requirement  is  two  at  any  point  of  the  program 
(shown  in  Figure  3.3(c)). 


O  def 

%  use 


m 


(a) 


Figure  3.3:  Spill  code  with  N  =  2. 


SpOl-code  insertion 

The  spill-code  insertion  phase  allocates  spaces  for  spilled  live  ranges  on  S,  We  classify 
virtual  registers  into  global  and  local  virtual  registers.  Global  virtual  registers  are  those  who 
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appear  in  multiple  blocks.  Local  virtual  registers  are  those  that  appear  within  one  single 
basic  block.  The  spilled  live  ranges  that  correspond  to  the  same  virtual  register  are  always 
spilled  to  the  same  location.  In  other  words,  spaces  are  allocated  to  virtual  registers.  No 
space  is  allocated  to  a  virtual  register  if  no  live  range  of  the  virtual  register  is  spilled.  Global 
virtual  registers  have  their  own  locations.  Nonetheless,  spaces  for  local  virtual  registers 
are  squeezed/compacted  using  the  same  graph  coloring  technique.  That  is,  the  locations 
already  allocated  to  the  neighbors  of  a  spilled  live  range  are  not  allocated  to  the  live  range. 

After  spaces  for  spilled  live  ranges  are  allocated,  the  spill-code  insertion  phase  traverses 
each  instruction  and  inserts  spill  code  for  each  reference  of  spilled  live  ranges.  Each 
reference  of  a  spilled  live  range  has  its  own  spill-code  live  range,  which  acquires  a  register 
for  loading  from  or  storing  to  memory.  If  there  exists  a  source  operand  (use)  of  an  instruction 
and  the  corresponding  live  range  is  a  spill-code  live  range,  then  a  load  instruction  is  inserted 
before  the  instruction  to  load  the  value  of  the  operand  from  memory.  If  there  is  a  destination 
operand  (definition)  of  an  instruction  which  is  a  spill-code  live  range,  then  a  store  is  appended 
after  the  instruction  to  spill  the  value  of  the  live  range  to  memory. 


Shuffle-code  insertion 

Register  allocation  may  include  one  or  more  live-range  splitting  approaches  to  split  live 
ranges  into  smaller  segments  to  reduce  register  pressure.  In  this  register  allocation  frame¬ 
work,  splitting  can  occur  only  on  control-flow  edges  (not  within  a  basic  block).  The 
shuffle-code  insertion  phase  goes  through  control-flow  edges  and  inserts  shuffle  code  for 
those  live  ranges  that  should  span  across  the  edges  but  reside  in  different  storage  locations 
instead  (details  are  discussed  in  Section  4.6). 

3.2,1  Modeling  existing  register-allocation  approaches 

This  section  elucidates  how  various  register  allocation  approaches  fit  into  this  register 
allocation  framework.  Table  3.3  shows  how  the  register  allocation  framework  can  model 
various  register-allocation  approaches. 

Chaitin-style  coloring  [13,  5]  uses  a  simplification  process  to  determine  the  coloring 
order.  Simplification  moves  all  unconstrained  livtrmgQS  (degree  <  N)  to  the  C  stack;  when 
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Phases 

Register  allocation  \ 

Chaitin-style 

Priority-based 

Optimistic 

Fusion-style 

Gr^h  construction 

function 

function 

function 

region 

out-of-color  spill 
split 

Graph  reconstruction 

X 

X 

X 

X 

Coalescing 

X 

X 

X 

X 

Color  ordering 

sin^lification 
out-of-color  spill 

priority 

simplification 

simplification 

Color  assignment  ' 

call-cost  spill 

out-of-color  spill 
call-cost  spill 

out-of-color  spill 
call-cost  spill 

call-cost  spill 

Spill-code  insertion 

X 

X 

X 

X 

Shuffle-code  insertion 

X 

Table  3.3:  Model  different  register- allocation  approaches. 


a  node  is  removed,  all  edges  incident  upon  the  node  are  also  removed,  opening  up  further 
opportunities  for  simplification.  If  simplification  blocks  (all  nodes  have  degree  >  N),  a 
least-cost  live  range  is  spilled  (added  to  5)  so  that  the  simplification  process  can  proceed. 
The  Chaitin-style  approach  makes  out-of-color  spilling  decisions  in  the  color-ordering 
phase. 

Optimistic  coloring  [9]  also  uses  simplification  to  determine  the  coloring  order.  When 
simplification  blocks,  however,  rather  than  spilling,  optimistic  coloring  pushes  a  least-cost 
live  range  onto  C,  removing  the  live  range  from  the  residual  graph.  In  this  manner,  a  live 
range  that  would  have  been  spilled  by  the  Chaitin-style  approach  is  given  another  chance 
at  receiving  a  register  in  the  color-assignment  phase.  Optimistic  coloring  defers  out-of- 
color  spilling  decisions  until  the  color-assignment  phase,  which  spills  those  live  ranges  that 
cannot  be  assigned  legal  colors  (all  colors  are  taken  up  by  neighbors). 

The  priority-based  coloring  that  I  implement  is  Chow’s  priority-based  coloring  [16] 
without  live  range  splitting .  Priority-based  coloring  determines  the  coloring  order  according 
to  apriority  function.  This  approach  first  separates  out  unconstrained  live  ranges  and  pushes 
them  onto  C,  since  unconstrained  live  ranges  can  always  be  assigned  legal  colors.  The 
priority  function  determines  the  order  in  which  constrained  live  ranges  are  pushed  onto 
C  (from  the  smallest  to  the  biggest).  During  the  color-assignment  phase,  if  a  live  range 
cannot  be  assigned  a  legal  color  (because  of  conflicts  with  the  neighbors’  colors),  then 
priority-based  coloring  spills  this  live  range. 

This  framework  can  also  model  other  approaches  to  live-range  splitting  that  take  the 
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program  structure  into  account,  such  as  the  approaches  taken  by  Tera  [12],  Multiflow  [25], 
and  RAP  [53],  as  well  as  splitting  based  on  the  SSA  representation  [7],  and  the  integrated 
approach  to  splitting  and  spilling  provided  by  fusion-style  coloring[48].  Fusion-style 
coloring  makes  splitting  and  spilling  decisions  in  the  graph  construction  phase.  A  more 
detailed  discussion  of  how  fusion- style  coloring  models  other  approaches  is  presented  in 
Chapter  4. 

This  framework  emphasizes  what  priority-based  coloring  [16]  and  the  Chaitin-based 
approaches  have  in  common:  they  all  try  to  determine  the  color  ordering  in  which  live 
ranges  are  assigned  colors  (registers).  Priority-based  coloring  uses  cost  analysis  as  the 
priority  to  determine  the  ordering  and  guarantees  that  the  register  assignment  phase  assigns 
registers  to  the  most  important  (high  savings  of  memory  accesses)  live  ranges.  Chaitin- 
style,  optimistic,  and  fusion-style  simplify  the  interference  graph  to  decide  the  ordering. 
Optimistic  coloring  is  similar  to  priority-based  coloring  in  the  sense  that  both  approaches 
delay  out-of-color  spilling  decisions  until  the  color- assignment  phase. 

Fusion-style  register  allocation  requires  the  shuffle-code  insertion  phase  to  insert  code 
to  move  values  at  splitting  points.  Shuffle-code  insertion  is  just  a  nop  for  the  Chaitin-style, 
optimistic,  and  priority-based  approaches  because  these  approaches  do  not  split  live  ranges. 


Chapter  4 

Fusion-Based  Register  Allocation 


In  chapter,  I  explain  the  structure  of  fusion-based  register  allocation,  the  fusion  operation 
that  fuses  two  interference  graphs,  and  some  techniques  used  during  graph  fusion.  In 
Section  4.1,  I  present  an  overview  of  fusion-based  register  allocation.  In  Section  4.2,  I 
present  the  fusion  operator  that  fuses  two  graphs.  In  Section  4.3,  I  discuss  how  fusion- 
based  register  allocation  models  different  splitting  approaches.  In  Section  4,4, 1  describe 
delayed  spilling  that  avoids  making  premature  spilling  decisions  while  two  graphs  are  being 
fused.  In  Section  4.5,  I  describe  the  biased  coloring  technique  that  is  used  to  eliminate 
unnecessary  move  operations.  In  Section  4.6, 1  present  the  ordering  of  shuffle  code  and  the 
optimization  of  placing  shuffle  code.  In  Section  4.7, 1  measure  the  influence  of  fusion-based 
register  allocation  for  various  SPEC92  programs. 


4.1  Overview 

Fusion-based  register  allocation  identifies  regions,  builds  the  interference  graph  for  each 
region,  and  then  merges  all  regions’  interference  graphs  along  control-flow  edges.  In 
this  section,  I  concentrate  on  two  essential  phases  of  fusion-based  register  allocation, 
graph  construction  and  color  assignment,  that  make  spilling,  splitting,  and  register  binding 
decisions.  Graph  fusion  takes  place  during  the  graph  construction  phase  which  is  further 
divided  into  three  steps  (shown  in  Figure  4.1):  region  formation,  graph  simplification,  and 
graph  fusion. 

Here  we  define  some  notations  that  are  used  in  this  chapter. 
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Figure  4.1:  Structure  of  the  fusion-based  register  allocator. 

•  En)  indicates  a  control-flow  edge  that  goes  from  block  to 

•  Gji  represents  the  interference  graph  of  region  R. 

•  Ivi  (x)  denotes  the  segment  of  a  virtual  register  a:’s  live  range  lr{x)  that  extends  block 
Bi,  and  there  is  one  or  more  segment  for  every  region  R  where  x  is  live  and  reaching. 

•  ,  Iry]  is  the  interference  edge  that  connects  live  ranges  Ir^  and  Iry. 

•  Ivij  is  the  combined  live  range  of  /r,  and  /r^. 

•  ncighbor{lr)  is  the  set  of  all  live  ranges  that  Ir  interferes  with. 

•  '>p{lr)  is  the  set  of  all  interference  edges  of  Ir. 

•  GHiUH2  refers  to  the  resulting  graph  of  fusing  (or  merging)  Gri  and  Gh2> 

•  GR^^JR^  represents  the  resulting  graph  of  fusing  Gri  and  Gr2  along  the  control-flow 
edge  E. 

4.1.1  Region  formation 

In  this  phase,  regions  are  formed  using  any  number  of  possible  techniques.  A  region  consists 
of  multiple  blocks  (or  one  single  block)  and  the  edges  that  connect  the  blocks  within  the 
region.  For  example,  aregion  can  be  a  single  basic  block,  a  trace  [47],  a  superblock  [37],  a 
region  as  defined  in  [33],  the  blocks  at  a  particular  static  loop  nesting  level  [12],  or  the  blocks 
within  a  PDG  region  node  [53].  Control  flow  edges  that  lie  outside  of  regions  are  then 
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ordered  according  to  some  priority  functions  consistent  with  the  region  formation  approach, 
e.g.,  edges  entering  innermost  loop  regions  are  placed  before  those  entering  outermost  loop 
regions.  One  particularly  attractive  priority  function  is  the  use  of  execution  probabilities. 
These  can  be  derived  either  from  profile  information  [23, 60],  from  static  estimates  such  as 
loop  nesting  depth  [16],  or  from  static  branch  estimates  [47].  The  choice  of  edge  ordering 
is  orthogonal  to  register  allocation  but  of  course  influences  the  quality  of  the  code,  as  our 
register  allocation  framework  is  edge  order  sensitive:  shuffle  code  is  less  likely  to  end 
up  on  edges  that  are  ordered  first,  and  spilling  decisions  are  delayed  until  later  edges  are 
processed.  During  region  formation,  an  interference  graph  Gn  is  built  for  each  region  R, 

There  are  two  issues  related  to  choosing  the  size  of  the  base  region:  spill  and  shuffle 
costs.  The  smaller  the  size  of  the  base  region,  the  more  likely  the  register  allocator  generates 
shuffle  code  and  reduces  the  spill  cost  because  live  ranges  are  allowed  to  be  split  at  finer 
granularity  (at  the  cost  of  shuffle  code)  to  reduce  the  spill  cost.  On  the  other  hand,  the 
bigger  the  size  of  the  base  region,  the  more  likely  the  register  allocator  generates  spill  code 
to  reduce  the  shuffle  cost.  It  is  impossible  to  state  that  one  strategy  is  better  than  the  other 
because  the  actual  outcome  depends  on  characteristics  of  the  programs.  Chapter  5  discusses 
those  two  issues  in  detail. 

4,1,2  Graph  simplification 

The  objective  of  the  graph  simplification  phase  is  to  determine  how  many  live  ranges  must 
be  spilled  within  each  region.  If  an  interference  graph  Gh  can  be  simplified,  then  no 
spill  code  is  necessary  within  region  R,  But  if  Gr  cannot  be  simplified,  then  from  R's 
perspective  the  cheapest  live  ranges  to  spill  within  R  are  those  that  are  transparenU  i.e.,  a 
live  range  lr{y)  that  spans  R  with  no  definition  or  use  of  y  in  R.  From  a  global  perspective, 
the  choice  of  which  live  ranges  are  the  best  ones  to  spill  cannot  be  determined  at  this  point 
in  the  algorithm.  Thus  the  decision  on  which  live  ranges  to  spill  is  delayed  until  more 
global  knowledge  about  the  reference  patterns  is  available;  this  phase  determines  only  how 
many  transparent  live  ranges  need  to  be  spilled  within  each  region.  The  next  phase,  graph 
fusion,  determines  which  live  ranges  are  the  best  ones  to  spill.  This  technique  is  referred  to 
as  delayed  spilling  and  is  discussed  in  more  detail  in  Section  4.4.  There,  we  also  describe 
what  to  do  if  the  compiler  must  spill  more  live  ranges  than  there  are  transparent  ones. 
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4A.3  Graph  fusion 

The  graph  fusion  phase  takes  the  sequence  of  control  flow  edges  determined  by  the  region 
formation  phase  and  fuses  interference  graphs  along  each  edge.  Graph  fusion  is  based  on  a 
pov/erfulfusion  operator  that  maintains  the  invariant  that  the  resulting  interference  graph  is 
simplifiable  (i.e.,  can  be  simplified).  Live-range  splitting  decisions  are  made  by  the  fusion 
operator:  if  fusing  two  graphs  Gux  and  along  an  edge  E  results  in  an  interference 
graph  that  cannot  be  simplified,  then  one  or  more  live  ranges  that  span  E  are  split.  Only 
the  live  ranges  that  span  E  need  to  be  considered,  because  in  the  worst  case,  splitting  all 
such  live  ranges  partitions  the  graph  back  to  the  two  original  graphs,  Gri  and  Gr^, 

both  of  which  are  simplifiable.  At  the  end  of  the  graph  merging  phase,  we  are  left  with  one 
simplifiable  interference  graph;  we  know  how  many  live  ranges  to  spill  for  each  region  and 
where  to  place  the  shuffle  code,  but  no  physical  registers  are  committed  to  any  live  range. 
The  simplifiability  invariant  allows  us  to  avoid  making  any  coloring  decisions  prematurely 
during  the  graph  merging  phase.  For  the  example  of  Figure  2.2,  assume  that  we  choose 
loops  as  regions.  There  is  no  coloring  assignment  decision  for  any  live  range  while  graphs 
of  the  loop  region  are  fused.  All  we  know  is  that  the  interference  graph  of  the  loop  region 
is  colorable.  The  actual  coloring  decision  is  deferred  till  the  color-assignment  phase. 

As  more  graphs  are  fused,  the  delayed  spilling  mechanism  gradually  spills  live  ranges. 
The  net  effect  of  combining  splitting  with  delayed  spilling  is  that  the  register  allocator  may 
spill  only  those  segments  of  a  virtual  register’s  live  range  that  do  not  contain  references  but 
span  regions  of  high  register  pressure. 

4.1.4  Color  assignment 

In  this  phase,  physical  registers  are  assigned  to  live  ranges.  The  simplifiability  invariant 
guarantees  that  once  all  interference  graphs  have  been  fused,  the  resulting  interference 
graph  is  colorable  and  out-of-color  spilling  or  splitting  decisions  have  already  been  made. 
Maintaining  the  simplifiability  invariant  is  necessary,  otherwise  we  are  back  to  the  original 
problem  of  making  splitting  and  spilling  decisions  based  on  a  whole  interference  graph. 
There  exist  further  opportunities  for  improving  the  code  in  this  phase:  biased  coloring 
[7]  may  eliminate  shuffle  code,  optimistic  coloring  [9]  may  assign  registers  to  live  ranges 
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that  have  been  spilled  by  simplification,  and  the  enhancements  discussed  in  Chapter  6  may 
decide  to  spill  live  ranges  to  reduce  call  cost  (call-cost  spilling). 

4,1.5  Properties  of  graph  fusion 

Initially,  each  interference  graph  is  sparser  (i.e.  less  constrained)  than  the  function-wide 
interference  graph  G f  unction  because  each  graph  is  only  a  portion  of  G function  >  As  graph 
fusion  proceeds  further,  more  interference  graphs  are  fused,  resulting  in  denser  interference 
graphs  (i.e.  more  constrained)  than  the  ones  that  are  fused  earlier.  Thus,  the  earlier  an  edge 
is  considered  by  the  fusion  process,  the  less  likely  it  is  that  live  ranges  are  split  along  that 
edge.  Fusing  graphs  based  on  an  edge  ordering  provides  a  nice  property:  if  we  don’t  want 
shuffle  code  on  a  particular  edge,  we  can  fuse  the  interference  graphs  along  that  edge  first. 
Consequently,  the  decision  of  where  to  split  is  prioritized  according  to  the  edge  ordering,  and 
shuffle  code  ends  up  on  less  frequently  executed  edges.  Register  allocation  now  becomes 
an  problem  of  edge  ordering  and  forming  regions.  Compiler  writers  can  determine  the 
size  of  regions  and  the  edge  ordering  to  tune  the  quality  of  register  allocation.  Moreover, 
maintaining  the  simplifiability  invariant  allows  register  allocation  to  avoid  committing  any 
register  binding  decision  during  graph  fusion.  As  a  result,  no  early  binding  is  made. 


4.1.6  Example 

We  illustrate  the  above  steps  with  an  example.  Consider  assigning  registers  to  the  program 
fragment  shown  in  Figure  4.2(a).  Assume  we  have  only  two  physical  registers  (A  =  2)  and 
that  regions  are  basic  blocks.  There  are  three  virtual  registers  (x,  y,  and  z)  in  this  program 
with  initial  live  ranges  indicated  by  the  vertical  bars.  The  interference  graph  of  each  basic 
block  prior  to  fusion  is  depicted  in  Figure  4.2  (b).  Suppose  edges  1  and  3  form  the  most 
frequently  executed  path  and  edges  are  fused  in  the  order  1,  3,  2, 4.  After  we  fuse  graphs 
along  edges  1,  3,  and  2,  we  obtain  the  interference  graph  of  Figure  4.2  (c)  (interference 
graph  nodes  list  live  ranges  that  have  been  fused).  If  we  now  fuse  the  interference  graphs 
along  edge  4,  lr3{y)  and  lr^(y)  are  combined,  since  the  only  live  range  spanning  edge  4  is 
lr{y).  However,  graph  fusion  along  edge  4  makes  the  graph  unsimplifiable  (clique  of  size 
3),  so  the  algorithm  undoes  combining  Ir^iy)  and  /r4(y).  lr{y)  is  effectively  split  at  the 
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less  frequently  executed  point  in  this  control  flow  graph.  The  result  is  that  all  values  can  be 
kept  in  registers  with  an  additional  register  move  instruction  at  the  end  of  block  D^. 


4.2  Fusion  operation 

This  section  describes  the  essential  core  of  the  fusion^based  approach, /wi’/on  operation.  The 
fusion  operation  fuses  two  interference  graphs  into  one  and  maintains  the  simplifiability 
invariant.  Figure  4.3  depicts  several  important  steps  of  the  fusion  operation:  finding 
live  ranges  that  span  the  edge  E,  unioning  non-splitable  live  ranges,  merging  cliques, 
taking  a  snapshot  of  the  interference  graph,  unioning  splitable  live  ranges,  and  maintaining 
simplifiability. 


Figure  4.3:  Fusion  operation. 


4.2.  FUSION  OPERATION 


49 


4.2.1  Finding  spanning-E  live  ranges 

For  each  basic  block  i?,  two  sets  of  live  ranges  are  maintained:  RcachIn{D)  mdLivcOut{D) 
Given  a  live  range  lr{x),  lr(x)  G  RcachIn{D)  if  Ir^  is  not  defined  in  B  (definitions  of  Ir:^ 
reach  the  entry  of  D),  and  lr{x)  G  LivcOut{D)  if  /r(a:)  is  live  at  the  exit  of  B.  Consider 
two  basic  blocks  Bi  and  Bi  that  are  in  two  regions  Ri  and  R2,  and  connected  by  an  edge 
E  =  (Z?i,  B2)  in  the  control  flow  graph.  The  Rcachin  and  LivcOut  sets  indicate  which 
live  ranges  span  edge  E,  That  is,  if  lri{x)  G  LivcOut{Bi)  and  lr2{x)  G  RcachIn{B2)j 
one  definition  of  lr2{x)  comes  from  lri{x)  (there  may  be  other  definitions  reaching  lr2{x) 
from  other  paths  to  B2).  To  facilitate  applying  the  fusion  operation  along  edge  E,  we  define 
an  internal  structure  called  span-c  that  is  composed  of  two  live-range  pointers,  from  and 
to;  "from"  points  to  /r  1(0:)  and  "to"  points  to  /r2(a:). 

The  first  step  of  the  fusion  operation  traverses  LivcOut{Bi)  and  RcachIn{B2)  to  find 
out  which  live  ranges  span  edge  E.  The  fusion  operation  attempts  to  coalesce  the  live 
range  segments  lri{x)  and  /r2(a:)  in  the  interference  graph  shuffle  code 

is  needed  for  x  along  E,  This  step  returns  a  working  list,  w,  of  span^c. 

The  example  of  Figure  4.4  shows  the  relationship  among  the  structures  of  span-c, 
Rcachin,  and  LivcOut.  LivcOut(Bi)  has  three  live  ranges,  Irfx),  lri{y),  and  Irfz) 
because  the  virtual  registers,  x,y,  and  are  defined  in  Z?2  and  their  values  are  live  at  the 
exit  of  B\.  RcachIn(B2)  contains  lr2{x)  and  lr2{y)  because  x  and  y  are  defined  in  Bi. 
The  working  list  oj  for  edge  E  records  that  live  ranges  of  x  and  y  span  E. 


Figure  4.4;  Example  of  finding  spanning-E  live  ranges. 
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lri{x) 

lr2{x) 

Spilled  Transparent  Non-transparent 

Spilled 

Spilled  Spill  (Section  4.4)  Split  (Section  4.4) 

Non-splitable  Non-splitable  Splitable 

Transparent 

Coalesced,  transparent  Coalesced,  non-transparent 
Non-splitable  Splitable 

Non-transparent 

Coalesced,  non-transparent 
Splitable 

Figure  4.5:  Different  possibilities  when  coalescing  Iri(x)  and  lr2{x). 


4.2,2  Unioning  non-splitable  live  ranges 


There  are  two  places  (unioning  non-splitable  and  splitable  live  ranges  as  shown  in  Figure 
4.3)  where  the  fusion  operation  combines  live  ranges  that  span  edge  E,  Basically,  both 
steps  combine  ’’from”  and  “to"  live  ranges  of  span>e  into  one.  A  live  range  segment  /r,  can 
be  in  one  of  these  three  states:  it  has  been  spilled,  it  is  represented  in  Gui  and  transparent, 
or  it  is  represented  in  Gr.  and  non-transparent  (i.e.,  there  is  areference).  If  both  segments, 
“from"  and  "to",  are  spilled,  no  shuffle  code  is  needed.  If  lri{x)  is  spilled  and  lr2{x)  is 
transparent,  we  always  spill  the  combined  live  range  Irnix)  in  eliminate  the 

shuffle  code  that  is  needed  otherwise.  If  lri{x)  is  spilled  and  lr2{y)  is  non-transparent,  we 
have  a  split  point.  Otherwise  the  segments  are  coalesced,  although  if  the  new  interference 
graph  is  not  simplifiable,  coalescing  may  be  suppressed  (i.e.,  a  split).  If  both  lri{x)  and 
lr2(x)  are  transparent,  then  Irnix)  in  non-splitable  because  spilling  those  live 

ranges  do  not  introduce  spill  cost.  Figure  4.5  enumerates  all  possible  cases  of  combining 
two  live  ranges.  Empty  entries  in  this  table  are  symmetric  cases. 

This  step  traverses  the  a;  list  and  coalesces  the  live  ranges  whose  combined  live  ranges 
are  not  splitable.  Live  ranges  that  are  coalesced  in  this  step  are  removed  from  a;.  Combining 
two  live  ranges  "from"  and  "to"  consists  of  three  major  steps,  unioning  two  live  ranges, 
merging  interference  edges,  and  propagating  attributes  as  shown  in  Figure  4.6.  The  small 
diamonds  in  the  figure  are  condition  checkers.  For  instance,  the  checker  on  the  left-hand 
side  checks  if  the  combined  live  range  of  the  current  span_e  is  splitable  or  not. 
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Live  Ranges 
Spanning-E 

- 

-► 

Merging 

Cliques 

- 

Snap-shot 

- 

Unioning 
Splitabie  ir 

- 

Maintaining 

Simplifiability 

- 

— 

Figure  4.6:  Structure  of  unioning  2  non-splitable  live  ranges. 


Unioning  two  live  ranges 

How  we  union  two  live  ranges  depends  entirely  on  the  data  structures  of  live  range  and 
instruction.  A  typical  machine  instruction  is  composed  of  three  parts:  opcode,  source, 
and  destination  operands.  Prior  to  register  allocation,  operands  may  access  virtual  registers 
because  the  code-selection  phase  of  cmcc  selects  code  using  virtual  registers.  Two  essential 
aspects  are  kept  inside  the  operand  accessing  a  virtual  register,  (1)  a  pointer  pointing  to 
the  virtual  register,  and  (2)  a  pointer  pointing  to  the  corresponding  live  range  of  the  virtual 
register  at  the  instruction.  At  the  time  of  code  emission,  the  code  emitter  then  gets  the 
physical  register  of  the  operand  via  the  live-range  pointer.  During  region  formation  of  the 
graph  construction  phase,  an  interference  graph  is  built  for  each  region  and  the  live-range 
pointers  of  operands  are  set  properly. 

When  two  live  ranges  are  unioned  into  one,  the  operands  having  pointers  pointing  to  the 
two  live  ranges  must  be  able  to  acknowledge  the  change.  That  is,  the  operands  must  access 
the  combined  live  range  as  opposed  to  the  two  old  live  ranges.  One  solution  is  changing  the 
live-range  pointers  of  the  operands  to  point  to  the  combined  one.  However,  updating  the 
live-range  pointers  is  not  elegant  because  it  is  very  inefficient  to  update  pointers  every  time 
when  two  live  ranges  are  unioned.  Another  way  to  acknowledge  that  two  live  ranges  are 
unioned  is  through  manipulating  the  data  structure  of  live  ranges.  The  algorithm,  disjoint- 
set  forests  [19],  with  union-by-rank  and  path-compression  heuristics  is  used  for  unioning 
two  live  ranges.  Each  disjoint  set  represents  one  live  range.  Each  live  range  contains  one 
parent  pointer  and  rank  for  the  disjoint  set.  The  live  range  whose  parent  pointer  points  to 
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itself  (root)  is  the  representative  of  all  live  ranges  that  are  in  the  same  set.  Namely,  the  root 
live  range  keeps  all  information  regarding  color  assignment,  interference, ...,  etc.  All  other 
live  ranges  of  the  same  set  need  to  follow  their  parent  pointers  to  get  to  the  root  live  range  to 
retrieve  all  information.  As  a  result,  the  live-range  pointers  of  operands  are  never  required 
to  be  updated  using  the  disjoint-set  algorithm. 

Figure  4.7(a)  illustrates  an  example  of  unioning  two  live  ranges  of  a:  using  the  disjoint- 
set  algorithm.  Each  shaded  region  represents  one  live  range.  One  live  range  comprises 
Iri  and  lr2,  and  the  other  consists  of  /rs,  lr4,  and  Irs.  Arrows  indicate  the  parent  pointers. 
The  node  whose  parent  points  to  itself  is  the  root  that  represents  the  live  range,  e.g.  Iri 
and  /r3.  Unioning  the  two  live  ranges  is  done  by  simply  changing  /ra’s  parent  pointer  to 
point  to  Iri,  Iri  is  now  the  one  who  represents  the  union  of  the  two  live  ranges  as  shown  in 
Figure  4.7(b).  Finding  the  root  live  range  for  Ir^  traverses  the  parent  pointers  to  get  to  the 
root.  The  traversing  path  from  Ir^  to  the  root  is  compressed  by  changing  the  parent  pointer 
of  each  node  on  the  path  to  point  directly  to  the  root  (the  dotted  line  illustrated  in  Figure 
4.7(c)).  Changing  the  parent  pointer  of  each  node  on  the  traversing  path  is  known  as  path 
compression  (the  path  from  lr4.  to  /rj  is  now  shorter  than  the  one  of  Figure  4.7(b)). 


Figure  4.7:  Unioning  two  live  ranges  using  disjoint- set. 


There  is  one  price  to  be  paid  for  using  the  disjoint-set  algorithm —  superfluous  memory. 
The  size  of  the  data  structure  for  a  live  range  is  large.  The  cmcc  compiler  is  developed 
on  DEC  Alpha  machines  on  which  each  pointer  requires  64  bits  (8  bytes).  Using  virtual 
functions  in  C*H-  costs  each  object  whose  class  declares  virtual  functions  a  pointer  that 
points  to  the  virtual  function  table  for  that  class  [59, 49].  The  size  of  our  data  structure  for 
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a  live  range  is  184  bytes.  What  a  non-root  live  range  needs  is  the  parent  pointer  to  proceed 
to  the  root  to  retrieve  live-range  information.  The  other  fields  of  the  structure  are  never 
used  and  therefore  wasted.  To  alleviate  the  problem,  another  structure,  live-range  union, 
is  created.  This  structure  sits  between  operands  and  live  ranges.  Now,  an  operand  that 
accesses  a  virtual  register  has  a  live-range  union  pointer  rather  than  a  live-range  pointer. 
A  live-range  union  contains  a  parent  pointer,  rank  as  well  as  a  live-range  pointer.  Only 
root  live-range  unions  retain  live  ranges.  As  two  live  ranges  are  unioned,  the  combined 
live  range,  Irn,  of  Iri  and  Iri  is  either  Iri  or  /r2.  The  other  one  can  later  be  destroyed  and 
recycled.  Live-range  union  allows  the  internal  representation  of  register  allocation  to  use 
memory  not  more  than  necessary. 

Figure  4.8  sketches  the  structures  for  an  instruction,  a  live-range  union,  and  a  live  range. 
Only  the  root  live-range  union,  which  is  shaded,  retains  the  live-range  information.  The 
live  ranges  of  the  other  two  live-range  unions  have  been  freed  up  in  the  process  of  unioning 
live  ranges.  To  access  the  physical  register  that  the  srcl  operand  uses,  we  need  to  go 
through  lr..ptr  to  get  to  its  corresponding  live-range  union,  follow  the  parent  pointer  of 
the  live-range  union  up  to  the  root,  and  then  query  the  live  range  to  return  the  physical 
register  (see  Figure  4.8). 


Figure  4.8:  Live-range  union. 
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Merging  interference  edges 


The  two  live  ranges  have  their  own  interference  edges,  and  v?(/r2),  prior  to  uniting. 
In  this  step,  and  'p(lr2)  are  merged.  If  Irn  is  Iru  then  each  interference  edge, 

[/r2,  X],  of  Iri  is  updated  by  substituting  Iri  with  Irn  ([Irn,  A"])  where  A"  €  ncighbor{lr2). 
Likewise,  each  edge  of  is  also  updated  in  a  similar  fashion  if  Iru  is  It2^  Duplicate 
edges  occur  when  lr\  and  lr2  interfere  with  the  same  live  range  prior  to  uniting.  In  such  a 
case,  duplicate  edges  are  not  added  to  the  interference  graph  and  therefore  freed  up. 

The  example  in  Figure  4.9  illustrates  merging  interference  graphs.  The  interference 
graph  prior  to  unioning  Iri  and  lr2  is  given  in  Figure  4.9(a),  has  two  edges: 
and  [Iri.lry].  (p{lr2)  has  also  two  edges:  [lr2,lry]  and  [lr2jlrz]^  Both  Iri  and  It2  interfere 
with/ry.  After  we  combine /ri  and/r2into/ri2,  we  update '^(/n)  and  j^(/r2).  The  two  edges 
of  'p{lr\)  now  become  [/ri2,/ra;]  and  [/ri2,^ry];  the  two  edges  of  '^(It2)  become  [/ri2,/ry] 
and  [Iru^lr.^].  Merging  and  v^(/r2)  eliminates  one  duplicate  edge,  [/ri2,/ry].  The 
resulting  interference  graph  is  depicted  in  Figure  4.9(b). 


Figure  4.9:  Example  of  merging  interference  edges. 


Propagating  attributes 

Each  live  range  has  several  attributes  that  are  propagated  during  graph  fusion:  caller jcosU 
spilLcost,  has-def,  and  area.  The  callerxost  attribute  is  the  estimated  cost  of  assigning 
a  caller-saved  register  to  the  live  range — the  number  of  saves  and  restores  that  must  be 
executed  around  function  calls  if  the  live  range  is  assigned  a  caller- saved  register.  The  spill- 
cost  attribute  is  the  estimated  spill  cost  of  a  live  range.  The  caller-save  cost  and  spill-cost 
properties  are  used  during  color  assignment  to  decide  whether  a  live  range  is  assigned  a 
callee-save  register,  a  caller-save  register,  or  whether  it  is  more  profitable  to  spill  the  live 
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range.  The  has.def  property  is  true  for  a  live  range  lr{x)  if  there  is  a  definition  of  x  within 
It{x).  The  has-def  attribute  indicates  whether  a:*s  value  needs  to  be  stored  to  memory  at 
exits  from /r (a:);  if  a  live  range  does  not  modify  the  value  of  a  virtual  register,  then  no  stores 
are  required  at  exits  from  the  live  range.  The  area  attribute  indicates  the  size  of  the  live 
range  and  is  used  for  making  out-of-color  spilling  decisions  [5]. 

These  attributes  are  propagated  as  follows: 

•  hasAcf{lr\'2)  =  has-dcf{lr\)  V  hasjdcf[lri) 

•  caller -Cost{lr\^  =  caller  j:.ost{lr\)  +  callcr^cost{lr2) 

•  spilLcost{lr\2)  =  spill -cost  (Iri)  -f-  spill -cost  [Ir  2) 

•  arca{lr\2)  =  arca(lr\)  +  arca{lr2) 

4.2.3  Merging  cliques 

The  clique  summary  nodes  (Section  4.4)  are  used  to  delay  spilling  decisions.  Multiple 
cliques  may  exist  on  both  ends  of  the  edge  E  as  more  graphs  are  fused.  The  cliques- 
merging  step  deals  with  how  to  combine  cliques.  Details  are  described  in  Section  4.4. 

4.2.4  Taking  a  snapshot  of  the  interference  graph 

As  mentioned  earlier,  combined  live  ranges  may  be  split  to  maintain  the  simplifiability 
invariant.  As  two  live  ranges,  lr\  and  lr2,  are  combined  into  Irn,  their  interference  edges, 
V?(/ri)  and  '>p(lr2),  are  merged.  Duplicate  edges  may  arise  and  are  thrown  away.  To  undo 
combining,  the  register  allocator  must  retain  the  original  interference  graph.  Otherwise,  it 
is  hard  and  very  inefficient  to  recompute  interference  information.  This  step  takes  a  partial 
snapshot  of  the  current  interference  graph  before  combining  any  splitable  live  ranges. 
Because  only  splitable  live  ranges  need  their  original  interference  information,  the  origial 
interference  edges  of  these  live  ranges  are  stored.  Non-splitable  live  ranges  were  already 
removed  from  a;  in  the  step  of  unioning  non-splitable  live  ranges.  Now  all  live  ranges  of  u> 
are  splitable.  This  step,  therefore,  stores  the  interference  edges  of  the  live  ranges  of  a?. 
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4.2.5  Unioning  splitable  live  ranges 

This  step  traverses  a;  and  combines  each  pair  of  live  ranges  (all  resulting  live  ranges  are 
splitable).  Unioning  two  splitable  live  ranges,  Irj  and  Irz  requires  two  operations  that  are 
similar  but  not  identical  to  unioning  non-splitable  live  ranges.  First,  the  two  live  ranges  are 
combined  into  one  live  range,  /ri2,  as  described  in  Section  4.2.2.  Nevertheless,  the  slight 
distinction  is  that  there  is  no  path  compression  between  this  step  and  the  step  that  commits 
live-range  unioning  decisions  in  Figure  4.10.  The  second  operation  merges  interference 
edges  without  freeing  up  duplicate  edges.  There  is  another  noteworthy  point — attributes 
of  live  ranges  are  not  propagated  at  this  point  of  time.  All  differences  of  unioning  non- 
splitable  and  splitable  live  ranges  originate  from  the  fact  that  the  latter  wants  to  retain  the  old 
interference  information  so  as  to  split  combined  live  ranges  easily,  and  the  former  doesn’t. 
The  task  of  deleting  duplicate  edges  and  propagating  attributes  of  splitable  live  ranges  is 
deferred  till  the  simplifiability  invariant  is  satisfied. 

We  avoid  compressing  paths  because  updating  the  parent  pointer  causes  a  serious 
problem  when  we  want  to  split  a  combined  live  range.  Consider  the  example  of  Figure  4.7. 
Presume  that  path  compression  happens  prior  to  splitting  combined  live  ranges.  The  parent 
pointer  of  lu  may  have  been  changed  to  Iri  (Figure  4.7(c)).  It  is  now  very  hard  to  split  the 
combined  live  range  because  we  have  no  idea  if  lr4  belongs  to  Iri  or  Ir^  initially,  unless 
there  is  some  extra  bookkeeping.  On  the  contrary,  if  we  don’t  allow  path  compression  to 
happen,  then  splitting  the  combined  live  range  can  be  done  by  simply  changing  Ir^'s  parent 
to  point  to  itself. 


4.2.6  Maintaining  the  simplifiability  (colorability)  invariant 

The  step  that  maintains  simplifiability  ensures  that  the  invariant  holds  by  performing  sim¬ 
plification  on  Ghiuh2-  The  algorithm  described  in  [52]  is  used  to  deal  with  register  pairs. 
This  step  consists  of  four  major  components  as  depicted  in  Figure  4.10:  computing  degree, 
removing  a  LS-N  node  (degree  less  than  N),  lowering  register  pressure,  and  committing 
the  live-range  unioning  decision.  The  shaded  region  simplifies  the  interference  graph  until 
an  empty  graph  is  obtained. 
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Figure  4.10:  Maintaining  simplifiability. 


Computing  degree 

There  are  two  working  lists,  LS^N  and  GE..N,  used  by  the  simplification.  The  LS^N  list 
keeps  live  ranges  whose  degrees  are  less  than  N  (Max  colors).  The  GE-N  list  contains  live 
ranges  whose  degrees  are  greater  than  or  equal  to  N.  The  computing  degree  step  computes 
the  degree  of  each  live  range  of  the  interference  graph  and  initializes  the  two  lists. 


Removing  a  LS_N  node 

Again  the  small  diamonds  of  Figure  4.10  are  condition  checkers.  The  checker  between  the 
computing-degree  and  removing-LSJN  steps  checks  if  there  are  live  ranges  on  the  LS-N 
list.  If  LS-N  is  not  empty,  then  the  removing-LS  JSl  step  removes  one  live  range  from  the 
beginning  of  the  LS-N  list  and  then  removes  the  live  range  from  the  graph.  The  degrees  of 
the  live  range’s  neighbors  are  decremented.  If  any  neighbor  that  was  constrained  becomes 
unconstrained,  the  neighbor  is  removed  from  the  GE-N  list  and  inserted  into  the  LS-N 
list. 

Lowering  register  pressure 

If  all  remaining  live  ranges  are  on  the  GE-N  list  (and  LS-N  list  is  empty),  then  simpli¬ 
fication  of  blocks.  In  other  words,  fusing  the  interference  graphs  of  G^  and  Gr^ 
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along  edge  E  induces  high  register  pressure.  Some  actions  are  required  to  lower  the  register 
pressure.  In  a  first  attempt,  we  find  a  transparent  live  range  to  spill,  instead  of  splitting 
live  ranges.  Spilling  a  transparent  live  range  incurs  no  spill  cost  (but  implies  shuffle  cost), 
and  the  delayed  spilling  technique  may  lower  the  incurred  shuffle  cost,  i.e.  push  shuffle 
code  onto  low  frequently  executed  edges.  After  spilling  a  transparent  live  range,  degrees 
of  its  neighbors  are  decreased  correspondingly.  If  there  is  no  transparent  live  range,  then 
a  coalesced  (splitable)  live  range  Irn  is  chosen  from  the  remaining  constrained  nodes  to 
split.  Iri2  will  be  split  along  E  back  into  Ir^  and  The  choice  of  which  live  range  to 
split  is  solely  based  on  a  node’s  degree,  because  the  shuffle  costs  (move)  at  edge  E  for  all 
splitable  live  ranges  are  the  same.  After  splitting  Iruj  if  Iri  {lr2)  becomes  unconstrained, 
Iri  (lr2)  is  inserted  into  the  LS.N  list.  The  degree  of  the  interference  graph  may  be  lowered 
because  of  splitting,  allowing  simplification  to  proceed  from  the  point  where  it  blocked. 
If  simplification  blocks  again,  additional  live  ranges  are  split.  Experimental  results  show 
that  choosing  the  splitable  live  range  that  has  the  smallest  degree  incurs  fewer  splittings 
than  choosing  the  highest  degree.  The  reason  is  that  Iri  and  lr2  are  more  likely  to  become 
unconstrained  using  the  heuristic  of  choosing  the  smallest  degree.  Since  the  original  graphs 
Ghi  and  Gr^  are  simplifiable,  then  at  worst  all  live  ranges  that  span  E  are  split.  Figure 
4.11  depicts  the  flow  of  the  steps  to  lower  register  pressure.  The  splitting  step  may  generate 
critical  nodes  while  splitting  a  combined  live  range.  Section  4.2.7  elucidates  how  splitting 
live  ranges  generates  critical  nodes.  If  a  critical  node  arises,  then  simplification  is  reiterated 
from  the  computing-degree  step  as  illustrated  in  Figure  4.10. 

Splitting  a  splitable  live  range  Ivn  consists  of  three  steps  (in  the  shaded  region).  Union¬ 
ing  the  two  live  ranges,  lr\  and  lr2,  causes  one  of  them  to  become  a  child  of  the  other  using 
the  disjoint-set  algorithm.  The  first  step  removes  Irn  from  the  GE.N  list  and  makes  the 
child  to  be  a  root  by  changing  the  child’s  parent  pointer  to  point  to  itself.  The  second  step 
removes  all  edges  of  ^^(/rn),  retrieves  the  two  sets  of  original  interference  edges, 
and  '^(  lr2) ,  that  were  recorded  in  the  taking- snapshot  step.  If  the  child  is  Irt  then  each  edge 
of  is  updated  by  changing  [/r^,  A^]  to  [Iru  A']  where  A^  G  ncighbor{lri)  (reversing 
what  had  been  done  in  the  merging-edges  step  of  Section  4.2.5).  Each  edge  of  v?(/r2)  is 
updated  similarly  if  the  child  is  Iri.  and  '^ilr2)  are  hereby  added  back  to  the  graph 

again  (duplicate  edges  may  still  exist  and  are  not  put  back  to  the  graph).  The  third  step 
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Figure  4.1 1:  Lower  register  pressure. 


recomputes  the  degrees  of  Iri  and  /r2.  If  Iri  (lr2)  is  unconstrained,  lr\  {lr2)  is  inserted  into 
the  LS-N  list.  Otherwise,  they  are  inserted  into  the  GE^N  list. 

If  none  of  the  live  ranges  of  the  GE-N  list  is  splitable,  then  simplification  must  have 
produced  at  least  one  critical  node.  Otherwise,  simplification  should  proceed  without 
blocking.  Hence,  all  nodes  are  simply  removed  from  the  graph,  and  simplification  is 
repeated  from  the  computing-degree  step. 


Committing  live-range  unioning  decision 

By  the  time  we  get  to  the  "committing  unioning  decision"  step,  the  simplifiability  constraint 
is  satisfied  and  all  necessary  spilling  (transparent  live  ranges)  and  splitting  (coalesced  live 
ranges)  decisions  are  made.  Some  live  ranges  are  coalesced  and  some  live-range  unions  are 
suppressed.  The  step  now  commits  all  coalesced  live  ranges.  Namely,  those  coalesced  live 
ranges  will  never  be  split  at  edge  E.  Thus,  the  attributes  of  the  coalesced  live  ranges  are 
then  propagated.  Duplicate  interference  edges  can  now  be  eliminated.  Furthermore,  parent 
pointers  of  live  ranges  are  allowed  to  be  updated  freely  from  now  on. 
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4.2  J  Critical  node 

When  two  live  ranges  Iri  and  lr2  are  combined  into  Irn,  the  degrees  of  their  neighbors 
can  only  decrease  monotonously  (the  degree  will  never  increase).  If  one  live  range  Ir^ 
interferes  with  lr\  and  Iri,  coalescing  Iri  and  Ir^  produces  one  duplicate  edge.  The 
duplicate  edge  is  then  removed  from  the  graph,  as  a  result,  the  degree  of  is  decreased 
by  1.  Conversely,  splitting  a  coalesced  live  range,  Ivn,  increases  its  neighbors’  degrees 
monotonously  because  the  removed  duplicate  edge  is  added  back  to  the  interference  graph 
(the  degree  of  Ivj:  is  increased  by  1).  One  problem  arises,  due  to  this  property,  when  splitting 
is  taking  place  during  simplification  (Figure  4.1 1).  One  live  range  is  removed  (simplified) 
from  the  interference  graph  only  under  the  condition  that  the  degree  of  the  live  range  is 
less  than  N.  Again,  the  main  observation  is  that  the  register  allocator  is  able  to  find  a 
legal  color  for  the  live  range  regardless  how  colors  are  assigned  to  the  residual  graph.  The 
observation,  nonetheless,  may  no  longer  hold  once  coalesced  live  ranges  are  split  because 
the  degree  of  an  already-simplified  live  range  may  increase  and  become  constrained.  Those 
already-simplified  live  ranges  that  become  constrained  are  called  critical  nodes.  A  critical 
node  can  happen  only  when  live  ranges  are  split. 

Detecting  critical  nodes  is  done  by  keeping  track  of  the  degrees  of  already-simplified 
live  ranges.  The  notation  cr{lr)  denotes  the  degree  of  Ir  at  the  time  when  Ir  is  removed 
from  the  graph  during  simplification.  After  Ir  is  removed  from  the  graph,  a{lr)  (cr{lr)  <  N 
at  this  point  of  time)  is  then  frozen  and  not  decreased  when  subsequent  live  ranges  are 
removed  from  the  graph.  At  the  time  of  splitting  Irn  back  to  Iri  and  lr2,  a{lr)  is  increased 
by  1  if  both  Iri  and  lr2  interfere  with  Ir.  Whenever  cr{lr)  is  equal  to  N,  then  Ir  becomes  a 
critical  node  (the  condition  cr{lr)  <  N  is  violated). 

The  simplification  step  of  Figure  4. 10  never  tries  to  backtrack  when  critical  nodes  arise 
(because  it  is  hard).  Instead,  simplification  continues  until  an  empty  graph  is  obtained.  If 
critical  nodes  arise,  then  simplification  is  reiterated  from  the  computing  degree  step  to  make 
sure  that  the  resulting  graph  can  still  be  simplified  without  blocking.  The  nextpass  simplifies 
the  interference  graph  on  which  some  live  range  are  already  split  at  E.  Simplification  will 
terminate  within  a  fixed  number  of  iterations  because  a  critical  node  can  happen  only  when 
live  ranges  are  split,  and  there  is  only  a  fixed  number  of  splitable  live  ranges.  In  reality, 
critical  nodes  happen  rarely.  As  I  have  observed  so  far,  one  more  iteration  of  simplification 
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is  sufficient  to  preserve  the  simplifiability  invariant.  One  implementation  could  curtail  the 
chance  of  generating  critical  nodes.  Always  simplifying  an  unconstrained  live  range  that 
has  the  smallest  degree  lowers,  in  general,  the  degrees  of  live  ranges  at  the  time  when  they 
are  simplified.  Because  critical  nodes  occur  rarely,  the  technique  is  not  implemented. 

4.2,8  Example  of  critical  node 

The  example  of  Figure  4.12  shows  how  splitting  live  ranges  produces  a  critical  node  and  why 
re-performing  simplification  is  essential.  Figure  4.12  (a)  is  the  interference  graph  before 
graph  fusion.  With  N=3,  the  interference  graph  can  be  simplified.  One  legitimate  sequence 
of  removing  nodes  from  the  graph  during  simplification  is  lr\{x),  lr2{x),  lri{y),  lr2{y), 
lr{w),  lr(z),  and  lr{t).  Assume  that  the  fusion  operation  combines  lri{x)  with  lr2{x)  first 
and  lri{y)  with  lr2{y)  next.  Figure  4.12(b)  portrays  the  resulting  graph  after  lri{x)  and 
lr2{x)  are  combined  into  lrn{x).  As  lri{y)  and  lr2{y)  are  combined,  two  duplicate  edges, 
Uri2{y),lri2{x)]  and  from  the  graph  as  depicted  in  Figure 

4.12(c). 

There  are  five  live  ranges  in  the  final  graph  (Figure  4.12(c)).  /ri2(y),  lr{w),  lr{z), 
and  lr{t)  form  a  clique  of  size  4.  Light  shaded  live  ranges  and  dotted  interference  edges 
indicate  that  the  live  ranges  and  edges  are  simplified  and  removed  from  the  graph.  With 
iV  =  3,  lrx2{^)  can  be  simplified  because  the  degree  of  Irnix)  is  2.  Then  simplification  of 
the  remaining  graph  blocks  because  every  node  has  degree  =  3  (shown  in  Figure  4.12(d)). 
Splitting  Irniy)  back  to  lr\(y)  and  lr2{y)  adds  the  two  duplicate  back  to  the  interference 
graph  (shown  in  Figure  4.12  (e)).  Consequently,  the  degree  of  is  increased  by  1; 

becomes  a  critical  node.  The  degrees  of  lri{y)  and  lr2(y)  are  2  after  splitting  /ri2(y) 
(recall  that  /ri2(a:)  is  removed  from  the  graph  already).  Simplification  can  proceed  from 
where  it  blocked  by  simplifying  lri{y)  and  /r2(y),  and  an  empty  graph  is  obtained. 

Even  though  an  empty  graph  is  obtained,  the  simplifiability  invariant  doesn’t  hold  due 
to  the  critical  node  /ri2(a:).  If  we  perform  simplification  on  the  interference  graph  of 
Figure  4.12(e),  we  discover  that  simplification  blocks  and  cannot  remove  any  node  from 
the  graph  because  all  nodes  are  constrained  nodes  (degree  >  3).  When  critical  nodes  arise, 
re-iterating  simplification  splits  live  ranges  as  necessary  so  as  to  maintain  the  simplifiability 
constraint.  That  is,  if  simplification  blocks,  splitable  live  ranges  are  chosen  to  split.  Irnix) 
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Figure  4.12:  Critical  node  (N=3). 

is  split  in  the  second  pass  of  simplification  on  the  graph  of  Figure  4.12(e).  As  a  result,  the 
maintaining-simplifi ability  step  splits  the  interference  graph  of  Figure  4.12(c)  back  to  the 
original  graph  (Figure  4.12(a)). 


4.3  Modeling  different  splitting  approaches 

Before  we  describe  how  the  fusion-style  approach  models  various  splitting  approaches,  we 
need  to  know  how  for  a  region  R  is  constructed  because  different  approaches  can  be 
distinguished  based  on  the  strategy  used  to  select  regions. 
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4J.1  Constructing  Gji 

Constructing  Gr,  like  graph  fusion,  also  relies  on  the  fusion  operation  to  glue  small 
interference  graphs  into  one.  Namely,  both  constructing  Gr  and  graph  fusion  use  the  same 
framework  of  the  fusion  operation.  The  fusion  operation  that  fuses  the  interference  graphs 
of  two  regions  is  called  weak  fusion  (presented  in  Section  4.2)  because  splitting  live  ranges 
is  allowed  during  graph  fusion.  The  fusion  operation  fusing  two  graphs  that  belong  to 
one  region  is  called  strong  fusion  because  splitting  live  ranges  is  forbidden.  A  region  is 
composed  of  a  collection  of  basic  blocks  and  set  of  control-flow  edges.  Radges,  that  connect 
blocks  within  the  region.  Gr  is  constructed  in  two  steps.  First,  the  interference  graph  is 
built  for  each  block  in  the  region.  Second,  the  strong  fusion  operation  is  applied  to  every 
edge  of  Redges^  The  two  steps  are  similar  to  the  region  formation  and  graph  fusion  of  Figure 
4.1.  Blocks  are  regions,  and  graphs  of  regions  are  fused  using  the  strongs  rather  than  weak, 
fusion  operation.  Because  we  don’t  maintain  the  simplifiability  invariant  when  we  build 
Gr,  the  graph  simplification  step  of  Figure  4.1  is  not  necessary.  No  spilling  decisions  are 
made  during  constructing  Gr, 


Figure  4. 13:  Strong  fusion  operation. 


Figure  4.13  depicts  the  structure  of  the  strong  fusion  operation.  The  three  steps  high¬ 
lighted  by  dotted  rectangle  that  deal  with  splitting  live  ranges  are  just  nop:  taking  a  snapshot 
of  the  interference  graph,  unioning  splitable  live  ranges,  and  maintaining  the  simplifiability 
invariant.  All  live  ranges  that  span  E  e  Radges  are  non-splitable. 
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4.3.2  Tera 

The  Tera  approach  can  be  modeled  by  using  loops  as  regions.  The  edge  ordering  for 
the  control-flow  edges  that  connect  regions  are  ordered  based  on  the  estimated  execution 
frequency  using  loop  depth— the  edges  in  the  inner  loops  appear  in  front  of  the  edges  in 
the  outer  loops.  We  thereby  perform  graph  fusion  based  on  the  edge  ordering.  The  way 
that  we  model  the  Tera  approach  is  not  identical  to  what  was  described  in  [12]  because 
fusion- style  coloring  does  not  construct  the  tile  tree  and  perform  register  allocation  tile  by 
tile.  However,  both  approaches  have  the  same  perspective  to  register  allocation.  That  is,  the 
inner  loops  have  more  freedom  in  terms  of  assigning  registers.  The  fusion-based  approach 
should  outperform  the  Tera  approach  for  two  reasons.  First,  no  premature  binding  decisions 
have  been  made  in  the  fusion-based  approach.  Second,  the  Tera  approach  does  not  take  into 
account  the  edges  that  connect  a  parent  and  a  child  tile  to  prevent  live  ranges  from  being 
split  on  one  of  the  edges  that  is  executed  more  often  than  the  others.  The  fusion-based 
approach  can  prioritize  those  edges  based  on  profile  information  so  as  to  avoid  splitting  live 
ranges  at  frequently  executed  edges. 


4.3.3  PDG 

The  RAP  compiler  performs  register  allocation  in  a  hierarchical  fashion  based  on  the  PDG. 
A  region  node  in  the  PDG  is  a  collection  of  predicates  and  statements  that  are  executed  under 
the  same  control  conditions.  A  region  node  R  may  consist  of  subregion  nodes.  To  model 
the  PDG  approach,  we  can  exclude  the  subregion  nodes  from  R  and  consider  as  a  region 
in  fusion-style  coloring.  We  then  traverse  the  PDG  from  leaves  to  the  root  like  the  way 
that  the  RAP  compiler  performs  register  allocation.  During  traversing  the  PDG,  we  order 
control-flow  edges  by  assigning  higher  priorities  to  the  control-flow  edges  that  connect  the 
current  region  to  its  subregions  than  to  the  rest  of  the  control-flow  edges  that  are  not  yet 
assigned  priorities.  The  PDG  approach  modeledby  fusion-style  coloring  should  outperform 
the  conventional  PDG  approach  for  the  same  two  reasons  as  described  in  modeling  the  Tera 
approach.  First,  no  premature  binding  decisions  have  been  made  in  fusion-style  coloring. 
Second,  the  conventional  PDG  approach  does  not  take  into  account  the  execution  frequencies 
of  the  edges  that  connect  a  parent  region  and  a  subregion  whereas  the  fusion-style  approach 
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can  prioritize  the  edges  based  on  their  estimated  execution  frequencies. 

4.3*4  Multiflow 

The  Multiflow  compiler  integrates  scheduling  and  register  allocation.  Scheduling  and 
register  allocation  are  done  on  a  trace  basis.  That  is,  the  compiler  picks  a  trace  and  passes 
the  trace  to  the  code  scheduler;  the  code  scheduler  then  schedules  the  code  of  the  trace 
and  performs  register  allocation.  Because  the  cmcc  compiler  does  not  have  the  framework 
that  integrates  scheduling  and  register  allocation  like  the  Multiflow  compiler,  fusion-style 
coloring  can  model  only  the  register  allocation  part  of  the  Multiflow  approach.  One  key  idea 
of  the  Multiflow  approach  is  that  traces  chosen  first  have  more  freedom  in  using  registers — 
are  not  constrained  by  other  traces  because  other  traces  are  not  yet  compiled.  Fusion- style 
color  has  the  flexibility  to  choose  traces  as  regions,  maintains  the  simplifiability  invariant  to 
delay  color-assignment  decisions,  and  uses  the  delayed  spilling  technique  to  defer  spilling 
decisions.  Therefore,  unlike  the  Multiflow  approach,  traces  are  never  constrained  by  the 
color  binding  decisions  of  other  traces. 

There  are  two  counteractive  aspects  when  register  allocation  chooses  traces  as  regions. 
The  code  scheduler  requires  instructions  to  exploit  ILP.  The  bigger  the  trace,  the  more 
instructions  the  code  scheduler  can  exploit.  However,  the  bigger  the  trace,  the  more  register 
pressure  the  trace  imposes  on  register  allocation.  Once  a  trace  is  picked,  no  live-range 
splitting  could  happen  using  the  Multiflow  approach.  As  a  result,  the  spilling  decisions 
made  by  the  register  allocator  are  all-or-nothing  spilling.  The  fusion-based  approach,  on 
the  other  hand,  has  the  flexibility  to  further  divide  a  trace  into  smaller  regions  and  applies 
graph  fusion  to  combine  the  graphs  of  the  smaller  regions  into  the  interference  graph  of  the 
trace.  Live-range  splitting  is  therefore  possible  within  the  trace. 

4.3.5  SSA 

With  slight  modification,  the  fusion-based  framework  can  model  the  approach  that  uses 
the  SSA  representation  to  determine  splitting  points.  Constructing  Gssa  is  similar  to 
constructing  G/i  where  R  is  the  region  that  covers  the  whole  function.  Splitting  a  live  range 
at  a  ^  node  is  the  same  as  splitting  the  live  range  at  the  entry  edges  of  the  block  that  contains 
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the  6  node  because  6  nodes  are  inserted  into  the  entry  of  blocks  [20].  We  apply  the  strong 
fusion  operation  on  every  control-flow  edge.  During  the  step  of  unioning  non-splitable 
live  ranges  (Figure  4.13),  coalescing  two  live  ranges  is  suppressed  if  the  current  edge  is  a 
<^-node  splitting  point  for  the  two  live  ranges.  The  later  color  assignment  eliminates  shuffle 
moves  using  biased  coloring. 

4.4  Delayed  spilling 

When  a  region  It  needs  M  physical  registers  to  be  colored,  and  M  >  N,  then  some  live 
ranges  must  be  spilled.  Considering  only  local  spill  costs  inside  li,  the  best  spill  choice  is 
a  transparent  live  range  Lu  since  Lt  has  a  high  degree  in  the  interference  graph,  and  the 
cost  of  spilling  Lt  is  zero  inside  Ft  If  we  assume  that  there  are  T  transparent  live  ranges 
in  region  Ft,  and  K  live  ranges  must  be  spilled,  then  there  are  three  cases  that  must  be 
considered: 

/i  <  T:  In  this  case,  the  number  of  transparent  live  ranges  is  more  than  the  number  of 
live  ranges  that  must  be  spilled.  However,  choosing  the  transparent  live  ranges 
for  spilling  should  be  delayed  until  the  compiler  obtains  more  global  information 
about  the  reference  patterns  of  the  transparent  live  ranges.  Searching  the  region’s 
immediate  neighbors  does  not  solve  the  problem  because  transparent  live  ranges  may 
be  transparent  across  many  basic  blocks.  Delayed  spilling  deals  nicely  with  this  case, 
as  explained  below.  A  large  number  of  transparent  live  ranges  is  common  while 
processing  the  high-priority  edges. 

K  =  T:  In  this  case,  the  spill  needs  are  satisfied  by  spilling  all  transparent  live  ranges. 

F{  >  T:  In  this  case,  all  transparent  live  ranges,  as  well  as  A  -  T  live  ranges  with  a 
reference  inside  R,  are  spilled.  The  spilling  decisions  are  selected  by  a  heuristic 
based  on  spill  cost,  area,  and  the  degrees  in  the  interference  graph. 

Although  spilling  a  transparent  live  range  is  very  attractive  (because  it  has  no  spill 
cost),  it  may  not  be  globally  optimal.  Figure  4.14  shows  a  flow  graph  for  which  spilling 
transparent  live  ranges  is  not  optimal.  In  Figure  4.14(a),  each  of  Di  and  Di  has  three  live 
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Figure  4.14:  Non-optimal  spilling  (N=2). 


ranges  simultaneously  live.  Regions  are  basic  blocks.  The  vertical  lines  on  the  right  of  the 
blocks  represent  the  live  ranges  of  the  virtual  registers  x,  y,  and  (a  lightly  shaded  line 
indicates  that  the  live  range  is  transparent).  The  number  next  to  each  control-flow  edge 
is  the  execution  frequency  of  the  edge,  Bi  and  D3  have  the  same  execution  frequency, 
100.  Three  colors  are  needed  to  color  the  interference  graph  of  Di  and  Di.  Since  there  are 
only  two  registers,  we  must  spill  one  live  range  from  each  block.  Figures  4.14(b)  depicts 
the  case  of  spilling  transparent  live  ranges  (spilled  live  ranges  are  drawn  with  a  hatched 
pattern).  lri{x)  midlr2{z)  are  transparent  live  ranges  and  spilled  to  reduce  register  pressure. 
Although  the  two  live  ranges  do  not  have  spill  cost,  they  need  shuffle  code  to  move  values 
between  registers  and  memory.  The  shuffle  cost  is  370  load/store  operations.  The  optimal 
result  (least  overhead  operations)  in  this  example  is  spilling  y’s  live  ranges  (^ri(y)  and 
Iriiy))  as  depicted  in  Figure  4.14(c)  (only  190  operations  of  spill  code). 

I  now  describe  in  more  detail  the  delayed  spilling  technique  used  to  handle  the  case 
where  K  <  T.  Since  all  transparent  live  ranges  conflict  with  each  other,  these  live  ranges 
form  a  clique  in  the  interference  graph.  The  transparent  live  ranges  are  therefore  collected 
into  a  single  clique  summary  node  C  in  the  interference  graph,  as  depicted  in  Figure 
4.15.  The  node  C  contains  an  edge  to  all  other  nodes  in  the  graph,  since  the  transparent  live 
ranges  interfere  with  all  other  live  ranges  in  a  region.  The  clique  summary  node  is  annotated 
with  the  number  of  transparent  live  ranges  that  it  represents,  i.e.,  T{C).  We  record  that 
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il>{C)  =  K  of  the  T  transparent  live  ranges  must  be  spilled,  \vithout  specifying  which  ones. 
The  actual  size  of  the  clique  is  thus  T(C)  -  The  clique  is  dealt  with  as  a  single  unit; 
eventually  xl>{C)  live  ranges  will  be  spilled.  By  keeping  a  summary  node  in  the  graph,  we 
can  keep  more  live  ranges  in  the  interference  graph  than  there  are  registers,  and  we  delay 
the  decision  on  which  live  range(s)  to  spill  until  more  information  is  available.  The  clique 
representation  affects  the  graph  coloring  in  two  aspects:  computing  the  degree  and  fusing 
graphs. 

4AA  Computing  degree 

The  degree  of  an  interference  graph  node  is  usually  computed  by  summing  the  number 
of  interference  graph  edges  that  are  incident  upon  the  node.  With  clique  summary  nodes, 
however,  additional  care  must  be  taken.  Consider  a  clique  node  C  and  a  regular  node  /r, 
connected  by  an  interference  edge  E,  The  degree  that  edge  E  contributes  to  node  Ir  is 
T{C)  -  4'{0)  because  Ir  interferes  with  every  nodes  inside  clique  C\  The  degree  that  edge 
E  contributes  to  clique  C  is  one.  In  addition  to  the  interference  edges  of  C,  the  degree  of 
a  clique  node  must  add  T{C)  -  i>{C)  -  1  because  within  the  clique,  each  node  interferes 
with  T{C)  -  iy{C)  -  1  other  nodes. 

4,4.2  Fusing  graphs 

As  the  interference  graphs  of  larger  and  larger  regions  are  fused,  the  compiler  acquires  a 
more  global  understanding  of  the  reference  patterns.  This  knowledge  is  especially  useful 
if  the  order  of  fusing  graphs  reflects  either  program  structure  or  execution  frequency. 
Contiguous  spilled  and  nomspilled  live  range  segments  do  not  incur  shuffle  cost  within  the 
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segments  because  no  shuffle  code  is  required.  Therefore,  the  tendency  of  progressively 
growing  live  ranges  during  fusing  determines  which  transparent  live  ranges  need  to  be 
spilled  eventually.  And  the  expansion  of  live  ranges  is  based  on  the  edge  ordering,  which 
makes  the  spilling  decisions  sensitive  to  the  edge  priority  function,  e.g.,  program  structure 
or  execution  frequency. 

Given  an  edge  E  =  (i?i,  Z?2),  the  live  ranges  that  span  across  E  are  merged.  When  a 
live  range  Iri  is  merged  with  Irz,  and  one  of  the  two  is  transparent,  there  are  three  cases  to 
consider: 

1.  If  It\  is  a  spilled  live  range,  lr2  is  in  a  clique  C  (transparent):  lr2  is  removed  from 
C  and  spilled.  Both  T{C)  and  i^{C)  are  decremented  by  one  because  one  spilling 
decision  is  made.  The  decision  grows  a  spilled  live  range,  thereby  allowing  the 
compiler  to  avoid  shuffle  code  on  the  edge  E  (a  contiguous  spilled  live  range  crosses 
the  edge  E), 

2.  If  lr\  is  a  non-transparent  live  range,  lr2  is  in  a  clique  C :  The  fusion  operation  makes 
a  non-spilling  decision  for  lr2  and  removes  /ra  from  C.  Only  T{C)  is  decremented. 
The  decision  enlarges  a  non-spilled  live  range;  since  the  live  range  is  in  a  register  in 
an  adjacent  region,  keeping  lr2  in  the  register  (not  spilled)  eliminates  the  otherwise 
needed  shuffle  code  on  the  edge  E.  Hence  the  compiler  favors  it  over  the  other  live 
ranges  in  the  clique. 

3.  If  lr\  is  in  clique  C\  and  It2  is  in  clique  C2  :  It\  and  /ra  are  coalesced  and  added 
to  a  new  clique,  C  =  C\  n  Ca.  After  processing  all  live  ranges  across  E,  there  are 
three  cliques  C,  C[  and  where  C[  ~  C\  —  C  and  Ci  —  C2  —  C.  V- (^)  establishes 
how  many  ranges  of  C  must  be  spilled  and  is  computed  as  a  function  of  4-{Ci)  and 
t/.'(6a).  There  exists  several  options  to  determine  how  many  live  ranges  to  spill  from 
C\  once  is  known,  rl>{C{)  and  i/’(Ga)  are  computed,  =  V'(^i)  —  4-{C) 

=  iliC2)  - 

(a)  riy{C)  =  mm{T{C),mdx(xi^{Ci),il^{C2))) 

(b)  rliC)  =  min(T(CO,min(vHC^),z/KCa))) 

(c)  v-(C')  =  inin(r(G),max([|^V-(C''i)], 
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(d)  4>(C)  =  min(T(a),min([^^KC'i)l, 

inin(r(C'), )  in  the  equations  is  required  to  make  sure  that  the  condition,  v- (^0  ^ 
T(C),  of  the  clique  C  holds.  The  first  option  tends  to  lower  functions  of  C[  and 
because  takes  the  maximum  of  V- (C-i)  V' (C-i)-  The  third  and  fourth  options 
attempt  to  divide  the  number  of  spilled  live  ranges  among  cliques  proportional  to 
their  sizes. 

If  T{C)  =  'tlf(C)  for  a  clique  C  at  any  time  during  this  phase,  then  all  live  ranges  of 
C  must  be  spilled.  At  that  point,  C  is  an  empty  clique  and  removed  from  the  interference 
graph.  When  a  live  range /r  is  removed  from  clique  (7,  acopy  ofv?(C7)  and  a  new  conflicting 
edge  [/r,(7]  (because  Ir  interferes  with  every  live  range  in  C)  are  generated  for  L.  The 
example  of  Figure  4.16  has  3  live  ranges  in  the  clique  C\  We  want  to  remove  Iva;  from  the 
clique.  A  copy  of  the  interference  edges  (edge  1  and  2)  of  C  is  generated  (dashed  lines). 
The  new  generated  interference  edge  [/ra;,C]  is  added  as  well. 


Clique  C 

Figure  4.16:  Removing  a  live  range  from  a  clique. 

4.4.3  Example  of  delayed  spilling 

In  Figure  4.17(a),  each  basic  block  has  three  live  ranges,  and  they  all  conflict  with  each 
other.  Again,  the  vertical  lines  on  the  right  of  the  blocks  represent  the  live  ranges  of  the 
virtual  registers  cc,  y,  and  (a lightly  shaded  line  indicates  that  the  live  range  is  transparent). 
Three  colors  are  needed  to  color  the  interference  graph  of  each  block.  Since  we  have  only 
two  registers,  we  must  spill  one  live  range  from  each  block.  Assume  that  basic  blocks  are 
regions.  The  graph  simplification  phase  (Section  4.1.2)  needs  to  spill  one  live  range  from 
each  interference  graph  of  basic  blocks  so  as  to  maintain  the  simplifiability  constraint. 
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Figure  4.17:  Example  of  delayed  spilling  (N=2). 


Without  delayed  spilling,  spilling  decisions  need  to  be  made  locally.  The  decisions  may 
not  be  globally  optimal.  If  lr\{y),  lr2{z),  and  lr3{x)  are  spilled  (no  spill  cost),  six  load/store 
operations  (shuffle  code)  are  required  to  move  values  between  registers  and  memory,  one 
store  and  one  load  operations  for  each  live  range  as  shown  in  Figure  4.17(b).  The  delayed 
spilling  technique  is  able  to  delay  spilling  decisions  by  constructing  a  clique  for  each  block 
to  collect  the  two  transparent  live  ranges  of  the  block.  The  V-  function  for  each  clique  is 
one  (one  out  of  two  will  eventually  be  spilled).  Consider  that  the  interference  graphs  of  Z?i 
and  Z?2  are  fused.  The  unioning  rules  described  in  Figure  4.5  decide  that  Irzix)  and  lri{y) 
shouldn’t  be  spilled  because  lri{x)  and  lr2{y)  are  non-transparent  live  ranges.  Therefore, 
we  spill  lri{z)  dndlr2(z)  (depicted  in  Figure  4. 17(c)).  This  spilling  decision  results  in  only 
four  load/store  operations. 


4.5  Biased  coloring 

lri{x)  and  lr2{x)  are  called  adjacent  partners  if  their  values  are  connected  by  a  splitting 
point.  Two  live  ranges  are  remote  partners  if  there  is  no  splitting  point  between  them  but 
they  are  parts  of  the  same  live  range  if  we  construct  the  interference  graph  for  Chaitin-style 
coloring  (no  live-range  splitting).  For  instance.  Figure  4. 18(a)  depicts  an  case  where  the  live 
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range  lr{x)  spans  across  3  blocks  and  is  split  into  3  smaller  live  ranges,  Iri(x),  lr2ix),  and 
lr3(x).  Iri(x)  and  Irzix)  are  adjacent  partners,  and  lr2{x)  and  Irsix)  are  adjacent  partners 
also.  Iri(x)  and/r3(a;)  are  remote  partners. 

A  shuffle  move  is  required  if  lr\(x)  and  lr2(x)  end  up  in  different  registers.  Biased 
coloring  [7]  is  a  common  technique  to  eliminate  a  shuffle  move  between  Iri(x)  and  lr2ix) 
by  assigning  them  the  same  register.  We  define  some  notations  that  are  used  in  describing 
biased  coloring. 


adjaccnt(lr) 

partncr(lr) 

color(lr) 

avoid(lr) 

avail(lr) 


all  adjacent  partners  of  Ir 

all  adjacent  and  remote  partners  of  Ir 

the  assigned  color  of  Ir 

the  set  of  colors  that  Ir  would  like  to  avoid  using 
{c  I  c  ^  U  color(n)] 

nEneigkbor{lr) 


Selecting  a  color  for  a  live  range  Ir  using  biased  coloring  consists  of  the  following  steps: 

1.  choose  a  color  c  €  avail{lr)  A  c  G  IJ  color{n). 

n^adj  acent{lr) 

2.  if  not  found,  choose  a  color  c  e  avail{lr)  Ace  pj  avail(n), 

7i^U7icolored  adjaceni{li') 

3.  if  not  found,  choose  a  color  c  e  avail{lr). 
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The  color  ordering  phase  determines  the  order  in  which  live  ranges  are  assigned  colors. 
The  first  step  finds  an  available  color  matching  a  colored  adjacent  partner.  The  second  step 
tries  to  avoid  using  a  color  that  uncolored  partners  cannot  use.  The  third  step  of  biased 
coloring  picks  a  color  freely  for  Ir  if  none  of  the  previous  two  attempts  succeeds.  Hence, 
the  biased  coloring  mechanism  in  [7]  has  3  defects. 

First,  a  long  live  range  may  be  split  into  several  smaller  segments.  Those  smaller 
segments  may  be  colored  in  an  arbitrary  order.  Without  considering  remote  partners,  biased 
coloring  may  easily  fail  to  find  a  color  for  all  the  segments.  Assume  that  the  color  ordering 
of  the  example  in  Figure  4.18(b)  is  lr]{x),  lr^{x),  and  lr2ix).  Two  live  ranges  circled  by 
a  dotted  line  are  adjacent  partners.  Green  color  is  assigned  to  lri{x).  When  selecting  a 
color  for  ^^{x),  the  register  allocator  chooses  a  color  that  is  still  available  for  lr2{x)  to 
increase  the  likelihood  of  assigning  lr2{x)  and  lr3{x)  the  same  color.  Because  lr2{x)  is  not 
yet  colored,  and  the  register  allocator  doesn’t  take  into  account  remote  partners,  it  is  likely 
to  assign  a  color  to  lr3{x),  which  is  different  from  /ri(a:).  As  a  result,  a  shuffle  code  is 
inevitable  regardless  what  color  is  assigned  to  lr2{x). 

Second,  the  biased  coloring  approach  does  not  consider  the  assigned  colors  of  neighbors’ 
partners  (  [j  |J  color{p))  while  selecting  colors.  In  other  words,  when 

n€  neighbor  (ir)  p£adjacent{n) 

choosing  a  color  for  /r,  the  register  allocator  may  assign  the  color  that  /r’s  uncolored 
neighbors  are  eager  to  get  even  though  there  are  other  available  colors  for  /r.  In  Figure 
4.18(c),  lri{x)  interferes  with  lr{y),  and  the  color  ordering  is  lr2{x),  lr{y),  and  /ri(a;). 
Assume  that  the  red  color  is  assigned  to  lr2{x).  As  biased  coloring  picks  a  color  for  lr{y), 
it  has  no  idea  about  the  assigned  color  of  /ri(a:)’s  partner.  Consequently,  the  red  color  may 
possibly  be  assigned  to  lr{y)  as  well.  Now  it  is  impossible  for  lr\{x)  to  have  the  same 
color  as  lr2{x)  because  the  red  color  is  a  conflicting  color  for  lri{x).  Thus,  shuffle  code  is 
required. 

Third,  biased  coloring  does  not  prioritize  the  shuffle  moves  so  as  to  attempt  to  eliminate 
frequently  executed  shuffle  moves. 

The  biased  coloring  approach  implemented  in  our  register  allocator  deals  with  the  3 
drawbacks  of  the  classical  biased  coloring  approach  adopted  by  prior  reseachers.  Partners  of 
a  live  range  in  our  biased  coloring  approach  consist  of  adjacent  as  well  as  remote  partners. 
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Partners  are  prioritized  in  such  a  way  that  adjacent  partners  have  higher  priorities  than 
remote  partners  and  are  prioritized  based  on  their  estimated  execution  frequencies  because 
frequently  executed  shuffle  moves  should  be  eliminated  first.  The  steps  are  described  as 
follows: 

1.  c  =  color{p),  where  p  =  max.priority{[x  \  x  G  partncr{lr)  A  color{x)  G 
avail{lr)]) 

2.  if  not  found,  choose  a  color  c  G  avail{lr)  Ac  ^  avoid{lr). 

3.  if  not  found,  choose  a  color  c  G  avail{lr). 

4.  if  cis  found,  avoid{n)-\-  =  c,  Vn  G  7icighbor{p),\/p  G  partncr(lr) 

The  first  step  chooses  one  of  the  assigned  colors  of  /r’s  partners  based  on  their  priorities. 
The  chosen  color  must  be  a  non-conflicting  color.  The  second  step  tries  to  find  a  non¬ 
conflicting  color  that  is  not  used  by  the  partners  of  /r’s  neighbors.  The  third  step  chooses  a 
non-conflicting  color  freely.  The  fourth  step  propagates  the  chosen  color  to  the  neighbors 
of  /r’s  partners. 

4.6  Placement  of  shuffle  code 

After  the  color- assignment  phase,  shuffle  code  is  inserted  as  necessary  along  edges.  At  an 
edge  E  {i?i,  Di),  shuffle  code  is  inserted  for  a  virtual  register  x  that  has  been  split  at  E, 
i.e.,  if  lri{x)  and  Iriix)  have  not  been  assigned  the  same  storage  location.  Shuffle  code  can 
be  of  three  types:  shuffle  move,  shuffle  store,  and  shuffle  load. 

One  simple  rule  of  inserting  a  shuffle  store  on  E  is  that  storing  the  value  of  x  back 
to  memory  when  Iri(x)  is  in  a  register,  and  lr2{x)  is  spilled.  The  rule,  however,  is  too 
conservative  because  a:’s  memory  location  may  already  contain  the  up-to-date  value.  When 
fusing  graphs,  the  fusion  operation  propagates  the  has.def  attribute  indicating  if  the  value 
of  a  live  range  is  defined  within  the  live  range.  If  the  has_def  attribute  of  a  live  range  lr{x) 
is  set,  then  shuffle  stores  are  needed  at  the  edges  that  exit  to  a  spilled  live  range  Ir'(x). 
For  example,  in  Figure  4.19(a),  lri{x)  is  assigned  to  a  register,  r  U.,  and  lr2{x)  is  spilled. 
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(a)  (b)  (c) 

Figure  4.19:  Example  of  inserting  shuffle  store. 


Because  x  is  defined  in  lri{x)  (has.def  is  set),  the  value  of  lri{x)  is  stored  back  to  memory 
by  inserting  a  shuffle  store  on  edge  Ei, 

When  the  has.def  attribute  of  a  live  range  is  not  set  (no  definition  in  the  live  range), 
shuffle  stores  may  not  needed.  For  instance,  in  Figure  4.19(b),  lri{x)  is  also  assigned  to 
register  rlx-,  and  lr2(x)  is  spilled.  However,  the  value  of  lri{x)  is  not  defined  in  lri{x) 
(has..def  is  not  set)  and  comes  from  a  spilled  live  range,  lr-i{x).  Because  the  value  of  x  in 
its  memory  location  is  still  up  to  date,  no  shuffle  store  is  needed  on  edge  Ei. 

We  have  to  pay  attention  if  the  value  of  Ir^  (a:)  comes  from  another  live  range  that  is  not 
spilled  and  whose  has_def  is  set  as  shown  in  Figure  4.19(c).  At  the  exit  edge  Ei  of  lri{x), 
the  value  of  x  in  its  memory  location  is  not  up  to  date  so  a  shuffle  store  is  required  to  update 
the  memory  location  of  x, 

A  simple  flow- analysis  technique  is  used  to  compute  if  the  value  of  a  live  range  lr{x) 
in  its  memory  location  is  up  to  date  or  not.  The  flow  analysis  operates  on  shuffle  moves 
unlike  typical  flow  analyses  in  the  global  optimization  phase  that  operate  on  control-flow 
edges.  The  has-def  attribute  flows  through  shuffle  moves  (forward  direction  from  sources 
to  destinations).  Therefore,  in  the  presence  of  the  flow  analysis,  we  know  if  a  shuffle  move 
brings  a  value  that  is  not  yet  stored  back  to  its  memory  location.  If  the  value  of  lr{x)  comes 
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from  a  shuffle  move  and  the  value  in  memory  is  not  up  to  date,  shuffle  stores  are  required 
at  the  edges  that  exit  to  spilled  live  ranges  regardless  of  the  has.def  attribute  of  lr(x). 

4.6.1  Ordering  of  shuffle  code 

Shuffle  code  on  edge  E  =<  Di,D2  >  may  be  placed  in  Z?i,  B2.  or  a  newly  created  block 
inserted  between  Di  and  D2.  If  Di  has  only  one  successor,  which  is  D2,  the  shuffle  code  is 
appended  to  the  end  of  the  schedule  of  Bi.  If  Bi  has  multiple  successors,  and  B2  has  only 
one  predecessor,  the  shuffle  code  is  inserted  into  the  beginning  of  the  schedule  of  Z?2.  In 
these  two  situations,  the  shuffle  code  is  executed  only  when  the  control  flow  goes  through 
E.  Jf  B I  has  more  than  one  successor  and  B2  has  more  than  one  predecessor,  edge  i?  is  a 
critical  edge  [42].  The  shuffle  code  inserted  in  B\  and  B2  may  be  executed  when  the  control 
flow  does  not  go  through  E,  Therefore,  we  perform  edge  splitting  on  E  [42];  a  new  block 
is  created  and  inserted  between  Bi  and  B2,  The  shuffle  code  is  then  put  in  the  new  block. 

There  is  an  ordering  in  which  shuffle  code  is  inserted.  Shuffle  stores  are  inserted  first. 
Shuffle  moves  are  appended  after  shuffle  stores.  Finally  shuffle  loads  are  inserted  after 
shuffle  moves.  A  shuffle  store  for  a  virtual  register  x  at  edge  E  means  that  the  live  range 
lri{x)  ends  at  the  exit  of  Bi  and  lr2{x)  is  spilled  to  memory  in  B2,  In  other  words,  lri{x) 
does  not  interfere  with  any  live  range  that  starts  at  the  entry  of  B2.  Likewise,  a  shuffle  load 
for  a  virtual  register  x  at  edge  E  indicates  that  live  range  lr2{x)  starts  at  the  entry  of  B2 
and  does  not  interfere  with  any  live  range  terminating  at  the  exit  of  B\.  The  two  conditions 
imply  that  shuffle  stores  should  be  ahead  of  shuffle  loads.  A  shuffle  move  for  a  virtual 
register  x  indicates  that  lri{x)  terminates  at  the  end  of  B\  and  lr2{x)  begins  at  the  entry  of 
i?2.  Therefore,  a  shuffle  move  has  both  the  conditions  of  shuffle  store  and  load.  A  shuffle 
store  ilri{y)  to  memory)  should  be  ahead  of  a  shuffle  move  ilri{x)  to  lr2(x)).  Otherwise, 
lri(y)  conflicts  with  lr2{x).  The  value  of  lri{y)  is  mangled  by  the  shuffle  move  if  the  same 
register  is  assigned  to  ln{y)  and  /r2(a:).  Similarly,  a  shuffle  move  should  be  in  front  of  a 
shuffle  load. 

Multiple  shuffle  moves  may  exist  on  edge  E,  Special  attention  must  be  given  to  the 
ordering  among  the  shuffle  moves  to  prevent  the  source  value  of  a  move  being  targeted 
by  another  move.  For  instance,  consider  two  shuffle  moves  for  x  and  y  (ra;3  and 

rj,4  ^  r.y3).  If  the  ordering  for  the  two  moves  is  r^3  f-  r^2  and  then  VyA  ry3,  the 
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value  of  y  in  is  destroyed  by  the  first  move  of  x.  If  we  swap,  however,  the  two  moves 
so  that  ry4  comes  first,  the  two  shuffle  moves  are  able  to  exchange  values  safely. 

The  approach  of  finding  the  ordering  for  shuffle  moves  is  done  by  constructing  a  directed 
graph  whose  nodes  are  shuffle  moves.  An  arc  from  a  move  mvi  to  mv2  represents  that 
the  destination  of  mv\  and  the  source  of  mv2  share  the  same  register.  Traversing  from 
leaves  to  roots  determines  the  ordering  for  shuffle  moves.  If  there  is  a  cycle  in  the  graph, 
saving/restoring  a  register  to/from  a  temporary  memory  location  is  necessary  to  replace 
a  move  in  the  cycle  so  as  to  break  the  cycle.  To  cite  an  instance,  consider  two  moves, 
ra;2  ^  ra:3  and  ^  Vyl.  There  is  a  cycle  between  the  two  moves.  The  code  sequence 
for  the  two  shuffle  moves  is  St  Vyl,  (temp);  ra:2^ra;3;  Id  ry3,  ( temp)  ; .  The 
number  of  overhead  operations  is  counted  as  3  (1  load,  1  store,  and  1  move),  not  2. 

4.6.2  Optimization  of  placement 

The  insertion  of  shuffle  code  is  simple  and  straight  forward,  but  this  step  may  result  in 
partial  redundancies  in  the  code.  For  instance,  the  has-def  property  of  a  live  range  lr{x) 
determines  whether  shuffle  stores  of  x  are  needed  on  the  exit  edges  of  lr{x).  When  the 
definitions  of  x  are  in  less  frequently  executed  blocks,  and  the  exit  edges  (splitting  points) 
are  more  frequently  executed  than  the  definitions,  the  shuffle  stores  are  executed  more  often 
than  necessary.  The  code  hoisting  [41],  code  sinking  [43],  and  pratial  redundancy  [15] 
techniques  can  be  used  to  optimize  the  placement  of  shuffle  loads/stores. 

However,  based  on  the  estimated  costs  of  def-cost  and  the  estimated  costs  of  the  shuf¬ 
fle  move  (register-to-register),  store  (register-to-memory),  and  load  (memory-to-register) 
operations,  a  simple  effective  technique  is  used  to  optimize  the  placement  of  the  shuffle 
stores.  For  each  live  range,  the  costs  of  the  shuffle  move,  store,  and  load  can  easily  be 
obtained  since  each  edge  is  annotated  with  the  estimated  execution  frequency  (either  static 
or  profile).  The  def-cost  of  a  live  range  is  the  estimated  (weighted)  number  of  definitions 
within  the  live  range.  If  the  def-cost  is  less  than  the  cost  of  shuffle  stores,  then  all  shuffle 
stores  are  eliminated  by  inserting  a  new  shuffle  store  right  after  each  definition  and  the 
shuffle  moves.  In  other  words,  for  each  definition  (which  includes  the  shuffle  moves),  there 
is  a  store  writing  the  new  value  back  to  the  memory  location  of  the  live  range  so  as  to  keep 
the  value  in  memory  up-to-date. 
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BO 

B1 


B2 


Figure  4.20:  Placement  of  shuffle  store. 


The  technique  is  quite  effective  in  practice.  The  reason  is  that  the  code  motion  step  of 
the  global  optimization  phase  moves  loop-invariant  common  subexpressions  out  of  loops. 
The  definitions  of  those  common  subexpressions  are  less  frequently  executed  than  the 
references  inside  the  loops,  and  they  may  be  spilled  inside  part  of  the  loops  due  to  the  high 
register  pressure.  In  the  absence  of  the  shuffle-code  placement  optimization,  shuffle  stores 
are  inserted  inside  the  loops.  These  shuffle  stores  are  likely  to  be  executed  more  often  than 
the  definitions. 

Figure  4.20  depicts  the  optimized  placement  of  the  shuffle  code.  The  live  range  x  is 
defined  in  Bq  and  live  through  out  the  whole  program,  x  is  spilled  within  the  loop,  as 
indicated  by  the  shaded  region.  The  shuffle  code  is  highlighted  in  bold  face.  The  straight 
forward  shuffle  code  insertion  requires  two  shuffle  code  operations,  depicted  in  Figure 
4.20(a),  one  shuffle  store  on  Edge  1  (from  Bi  to  the  spilled  region)  and  one  shuffle  load  on 
Edge  2  (from  the  spilled  region  to  B2).  The  shuffle  store  is  executed  on  every  loop  iteration 
in  despite  of  the  fact  that  x  is  never  modified  in  Bi  and  B2.  Since  x  is  defined  in  Bq,  which 
is  less  frequently  executed  than  Edge  1  is  traversed,  the  placement  of  the  shuffle  store  can 
be  optimized  by  eliminating  the  shuffle  store  on  Edge  1  and  inserting  a  new  shuffle  store 
immediately  after  the  definition  of  a;  (depicted  in  Figure  4.20(b)). 
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4.6.3  A  more  detailed  example 

In  this  section,  I  provide  an  example  to  show  the  interaction  between  edge  ordering  and 
shuffle  code  placement.  The  code  fragment  of  Figure  4.21,  is  taken  from  the  function 
update_weights  of  alvinn.  Presumably  there  are  three  registers. 

Consider  an  edge  ordering  based  on  loop  hierarchy: 

{i?3,  Bs),  {i?4,  B2),  (Z?3,  B,),  {B2,  Z?3),  {Bs.  Bs),  {Bu  B2),  {B,,  Bs), 

B'i  requires  all  3  registers  for  tmp74 ,  tmpV  6,  and  tmp77,  which  are  all  referenced 
within  the  block.  In  addition,  there  are  two  transparent  live  ranges  induced  by  tmp73  and 
tmp75.  These  two  live  ranges  are  spilled  when  the  interference  graph  is  simplified 
before  graph  fusion  takes  place.  Z?2  also  needs  all  3  registers,  therefore,  the  only  transparent 
live  range  (tmp74)  for  B2  is  spilled. 

As  we  fuse  G3  along  the  first  edge  {B^,  B^),  the  resulting  graph  doesn’t  require  any 
other  splitting  or  spilling  decisions  since  the  simplifiability  invariant  still  holds.  While 
fusing  G2  and  G4  along  edge  (Z?4,  B2),  the  spilled  live  range  of  tmp7  4  grows  (to  cover  Ba 
and  i?2)-  After  fusing  along  the  remaining  edges,  the  final  resulting  graph  requires  shuffle 
code  to  move  values  among  different  locations.  The  shuffle  code  is  highlighted  in  bold  face 
in  Figure  4.21  (b).  For  tmp7  5,  because  there  is  no  definition  within  the  loop,  the  shuffle 
store  doesn’t  have  to  be  on  edge  {B2,  B3)  and  is  placed  right  after  its  definition,  which  is 
outside  the  loops. 

If  we  change  the  edge  ordering  so  that  (i?4,  B2)  and  {B^,  Ba)  are  switched,  then  the 
register  allocator  ends  up  with  a  different  spilling  decision,  as  seen  in  Figure  4.21  (c).  The 
shuffle  load  of  tmp74  on  {Ba,  B5)  in  Figure  4.21  (b)  is  not  needed  because  Ba  is  no  longer 
in  tinp74’s  spill  region.  Furthermore,  the  shuffle  load  of  tmp75  is  placed  on  {Ba,  B2) 
instead  of  (Z?3,  Ba),  because  the  spill  region  for  tmp7  5,  indicated  by  a  dotted  line,  contains 
Ba>  (So  instead  of  being  in  the  loop,  it  is  now  on  the  loop  back  edge.) 

4.7  Evaluation 

I  measured  the  impact  of  the  register  allocation  strategy  for  various  SPEC92  programs 
(alvinn,  compress,  ear,  eqntott,  espresso,  gcc,  li,  sc,  doduc,  fpppp, 


tmp73  = .... 
tmp74  = .... 
tmp75  = .... 

for  ( ;  tmp73  <  limiti  ; )  { 

..  =  ..  tmp73  .. 

..  =  ..  tmp75  .. 
tmp76  = ..  tmp75 
for  ( :  tmp76  <  Iimit2  ; )  { 
tmp77  =  *(tmp76): 
*(tmp76)=tmp74*tmp77: 
tmp76  =  tmp76  +  4; 

} 

tmp73  =  tmp73  +  400; 

} 

for  ( :  tmp74  <  linnit3  ; )  { 

..  =  ..  tmp74  .. 

} 


tmp73= ... 
tmp74=: ... 
tmp75  = ... 


loop  1 


loop  2 


...  tmp73 ... 

...  tmp75 ... 
tmp76= ..  tmp75 


B1 


B2 


tmp77  =  *{tmp76) 
*(tnnp76)  =  tmp74  *  tmp77 
tmp76  =  tmp76  +  4 


B3 


tmp73  =  tmp73  +  400 


loop  3 


...  tmp74 . 
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Figure  4.21:  Code  fragment  from  alvinn  with  N  =  3. 
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matr  ix3  0  0,  nasa7,  spice,  and  tomcatv).  I  contrast  the  fusion-based  approach  with 
the  results  using  an  improved  Chaitin-style  approach  (with  the  enhancements  of  Chapter  6). 
I  use  all  registers  on  the  MIPS,  adhering  to  the  standard  calling  convention.  We  start  with 
the  smallest  region  size,  one  basic  block,  and  grow  these  until  we  have  the  interference 
graph  for  the  function.  All  functions  within  the  main  loop  of  alvinn  are  inlined. 

For  both  approaches,  I  run  two  experiments.  First,  I  use  only  static  information  to 
guide  register  allocation.  That  is,  we  use  loop  depth  to  estimate  the  execution  frequency 
of  a  block.  In  the  fusion-based  approach,  when  considering  the  basic  blocks  inside  a  loop, 
I  use  breadth-first  order  to  select  edges  for  fusing.  In  a  second  experiment,  I  use  profile 
information  from  a  prior  execution. 

As  expected,  for  programs  with  many  small  functions,  Chaitin’s  algorithm  works  fairly 
well,  and  the  fusion-based  approach  produces  identical  results.  For  most  of  these  programs, 
use  of  profile  information  improves  the  result,  independent  of  the  register  allocation  strategy. 
Let  us,  therefore,  turn  our  attention  to  programs  with  significant  register  pressure. 

alvinn,  nasa7,  tomcatv,  doduc,  and  f  pppp  are  the  programs  with  high  register 
pressure,  and  here  we  see  the  limitations  of  the  all-or-nothing  approach  to  spilling  that  is  the 
foundation  of  Chaitin-style  register  allocation.  Figure  4.22  shows  that  register  allocation 
based  on  graph  fusion  is  able  to  split  live  ranges  properly  and  thereby  cuts  for  alvinn 
nearly  52  %  of  the  overhead  required  by  Chaitin-style  allocation  using  profile  information. 
Chaitin-style  allocation  finds  the  “best”  complete  live  ranges  to  put  into  registers,  therefore, 
the  result  is  not  improved  by  using  profile  information.  Our  approach  breaks  live  ranges,  as 
can  be  seen  by  the  lower  number  of  spill  loads.  And  profile  information  provides  a  better 
cost  function  for  the  color-assignment  phase  (when  deciding  if  a  caller-save  or  a  callee-save 
register  should  be  assigned  to  a  live  range,  or  the  live  range  should  be  spilled),  resulting 
in  a  further  improvement.  Looking  at  the  results  for  f  pppp  using  static  information,  we 
see  the  same  story.  The  overhead  of  f  pppp  is  reduced  by  28  %;  the  large  contribution  of 
shuffle  code  indicates  that  live  ranges  have  been  split.  If  we  use  profile  information,  then 
both  approaches  find  the  “right”  live  ranges  and  give  the  same  result.  For  tomcatv,  our 
approach  removes  40  %  of  the  overhead  operations.  Because  the  estimated  static  execution 
frequency  provides  a  good  estimation,  the  profile  information  does  not  reduce  overhead 
operations  relative  to  the  static  case. 


Data  movement  operations  Data  movement  operations  Data  movement  operations 
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Figure  4.22:  All  available  registers  (26  int,  16  double). 
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There  are  3  big  functions  in  doduc:  subb,  supp,  and  ddef  lu.  These  three  account 
for  the  majority  of  data  movement  operations,  subb  and  supp  consist  of  only  one  big 
block.  For  those  types  of  functions,  our  register  allocator  produces  the  same  amount  of  data 
movement  operations  as  a  Chaitin-style  allocator,  since  the  sole  basic  block  is  the  complete 
compilation  unit.  One  simple  way  to  deal  with  such  blocks  is  to  partition  the  big  block 
into  smaller  blocks  so  that  our  approach  can  be  applied  to  make  better  spilling  and  splitting 
decisions.  We  show  in  Figure  4.22  the  result  for  ddef  lu,  where  large  portions  of  the  spill 
loads  and  stores  are  replaced  by  shuffle  code. 


4.8  Summary 

In  this  chapter  I  present  a  new  approach  to  register  allocation  which  is  fusion-based. 
The  approach  deals  with  spilling,  splitting,  and  color  assignment  in  an  integrated  fash¬ 
ion.  Fusion-based  register  allocation  produces  good  results  in  those  situations  where  a 
conventional  Chaitin-style  allocator  breaks  down.  Of  course,  there  are  programs  without 
significant  register  pressure.  However,  as  aggressive  compiler  transformations  become 
more  common,  even  those  simple  programs  are  turned  into  challenges  for  a  global  register 
allocator.  In  the  next  chapter,  I  investigate  how  the  fusion-based  approach  behaves  under 
the  register  pressure  caused  by  aggressive  code  transformations. 
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Chapter  5 
Reconciliation 


Nowadays,  modern  processors  have  multiple  function  units  as  well  as  the  ability  to  dy¬ 
namically  detect  data  dependences  and  to  issue  multiple  independent  instructions  per  cycle 
[57,  21,  4,  32,  31].  To  extract  as  much  instruction-level  parallelism  (ILP)  as  possible, 
modern  compilers  are  looking  at  increasingly  larger  regions  of  a  program.  Many  global 
optimizations  have  also  been  extended  to  concentrate  on  frequently  executed  regions  at 
the  potential  expense  of  less  optimal  code  in  infrequently  executed  regions.  For  exam¬ 
ple,  superblock  optimizations  [14]  extend  code  motion,  dead  code  elimination,  and  other 
optimizations  to  work  on  regions  comprising  extended  basic  blocks.  Regions  have  also 
been  used  to  guide  predication  and  if-conversion  [22].  This  general  compilation  strategy 
—  wherein  the  compiler  prioritizes  frequently  executed  program  regions  by  concentrat¬ 
ing  the  scope  of  an  optimization  phase  on  such  regions  —  has  been  termed  region-based 
compilation  [33].  To  exploit  the  full  power  of  region-based  techniques,  scope  enhancing 
transformations,  such  as  loop  unrolling  and  function  inlining,  are  necessary  to  enable  the 
formation  of  large  regions  that  comprise  the  most  frequently  executed  blocks  of  a  pro¬ 
gram.  For  example,  selective  unrolling  of  innermost  loops  is  an  attractive  scope-expanding 
optimization  for  region-based  compilers  [47], 


5.1  Challenges 

Region-based  compilation  is  at  odds  with  register  allocation:  standard  register  allocation 
algorithms  operate  at  the  scope  of  an  entire  function  and  do  not  focus  their  efforts  on  high 
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priority  regions.  Moreover,  scope  enhancing  transformations,  when  coupled  with  aggressive 
region-based  optimizations  (e.g.,  global  instruction  scheduling  and  speculative  code  motion) 
severely  increase  register  pressure,  creating  serious  problems  for  standard  function-wide 
graph-coloring  register  allocators.  For  example,  loop  unrolling  may  increase  the  register 
pressure  considerably.  After  common  subexpression  elimination,  strength  reduction,  and 
code  scheduling,  the  live  ranges  of  the  different  instances  of  the  (former)  loop  body  now 
overlap,  and  the  register  allocator  faces  a  more  difficult  situation  than  without  unrolling. 
If  the  register  allocator  is  unable  to  handle  the  unrolled  loop,  then  the  benefits  of  using 
aggressive  region-based  optimizations  can  be  negated  by  the  register  allocation  strategy. 

Furthermore,  how  to  reconcile  register- allocation  decisions  among  regions  is  another 
challenge  to  register  allocation.  The  IMPACT  compiler  generates  code  on  a  region  basis. 
Reconciliation  binding  decisions  among  regions  is  one  big  problem  for  the  compiler.  To 
cite  from  Hank’s  thesis  [34]  that  implements  register  allocation  in  the  IMPACT  compiler: 

If  the  register  binding  information  is  not  taken  into  account  during  the 
allocation  of  the  current  region,  separate  allocation  can  quickly  produce  an 
explosion  of  copy  operations  at  region  boundaries. 

Consider  that  lr{x)  and  lr\x)  are  in  two  adjacent  regions,  R  and  R!  respectively,  and  the 
two  live  ranges  span  across  an  edge  E  that  connects  the  two  regions.  Shuffle  code  is  required 
if  register  allocation  assigns  lr{x)  and  lr\x)  different  registers.  Assume  that  is  a  higher 
priority  region  than  R'  and  is  compiled  already.  The  IMPACT  region-basedregister  allocator 
uses  an  ad  hoc  approach  to  reconcile  binding  decisions  between  two  regions.  To  avoid  an 
excessive  number  of  copy  operations  (shuffle  moves),  the  IMPACT  compiler  attempts  to 
assign  lr\x)  the  same  register  as  lr{x).  If  the  compiler  fails  to  assign  the  register  of  lr{x) 
to  tr'{x),  then  lr'{x)  is  spilled.  Copy  operations  are  completely  eliminated  at  the  expense 
of  shuffle  loads/stores.  In  other  words,  expensive  (usually  more  than  one  cycle)  load/store 
operations  are  used  to  trade  off  cheap  (one  cycle)  copy  operations.  Because  the  IMPACT 
region-based  register  allocator  is  incapable  to  reconcile  binding  decisions  among  regions, 
we  see  that  this  region-based  register  allocator  produces  more  overhead  operations  than  a 
function-based  register  allocation  for  8  of  1 1  benchmark  programs.  Moreover,  when  we  see 
the  measurement  of  the  execution  time,  we  notice  that  5  of  the  1 1  benchmark  programs  using 
region-basedregister  allocation  run  slower  than  using  function-based  register  allocation. 
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In  this  chapter,  I  investigate  the  effectiveness  of  fusion-based  register  allocation.  There 
are  three  topics  to  gauge  the  impact  of  register  allocation  by  graph  fusion  on  region-based 
compilation: 

•  What  is  the  impact  of  the  choice  of  the  base  region  (Section  5.2); 

•  How  sensitive  is  register  allocation  to  profile  information,  which  determines  the 
various  heuristics  employed  in  a  register  allocator  (Section  5.3); 

•  How  sensitive  are  these  approaches  to  register  pressure  (Section  5.4). 

•  How  sensitive  are  these  approaches  to  different  region  sizes  (Section  5.5). 

•  How  sensitive  are  these  approaches  to  two  code  transformations,  loop  unrolling 
(Section  5.6)  and  function  inlining  (Section  5.7). 

5.2  Choice  of  base  region 

One  of  the  parameters  of  fusion- style  register  allocation  is  the  shape  of  the  base  regions 
(for  which  we  compute  the  initial  interference  graphs).  If  we  choose  a  complete  function 
as  the  base  region,  then  fusion-style  register  allocation  is  identical  to  the  (function-wide) 
Chaitin-style  register  allocation.  The  size  of  the  base  region  affects  spill  and  shuffle  costs. 
In  this  section,  we  investigate  how  the  size  of  the  base  region  has  an  effect  on  the  quality 
of  register  allocation.  Section  5.2.1  shows  an  example  that  favors  a  small  size  of  the  base 
region.  Section  5.2.2  depicts  an  example  that  favors  a  big  size  of  the  base  region. 

5.2.1  Favor  small  size  of  the  base  region 

The  example  of  Figure  5.1  (a)  is  taken  from  Figure  4.2(a).  This  case  prefers  a  small  size 
of  the  base  region.  We  list  the  execution  frequency  beside  each  edge.  The  most  frequently 
executed  path  is  Z?i  i?2  ^4-  The  edge  ordering  that  the  register  allocator  considers 

for  graph  fusion  is  <7?i,Z?2>,  <Z?i,2?3>,  and  <B^,Ba>.  Let  us  first  consider  the 

case  that  the  size  of  the  base  region  is  one  single  block.  Each  interference  graph  for  each 
block  is  shown  in  Figure  5.1(b).  After  graphs  are  fused  along  the  first  three  edges,  <Z?i,Z?2>» 


88 


CHAPTERS.  RECONCILIATION 


<D2,B4>,  and  <D^  ,i?3>,  the  resulting  graph  can  still  be  simplified  with  N=2  (shown  in  Figure 
5.1(c)).  As  the  graph  is  fused  along  <i?3.f?4>  by  combining  Ir^iy)  and  ln(y),  the  final 
graph  (shown  in  Figure  5.1(d))  becomes  too  constrained;  the  maintaining-simpliflability 
step  then  splits  the  live  range  of  y  on  edge  <Z?3,f?4>  (we  obtain  the  interference  graph  in 
Figure  5.1(c)  by  splitting  the  live  range  of  y).  As  a  result,  there  are  10  incurred  shuffle 
move  operations  using  one  single  block  as  the  size  of  the  base  region. 

Consider  that  the  size  of  the  base  region  is  increased  up  to  3  blocks.  Namely,  the  size  of 
the  base  region  cannot  exceed  3  blocks.  Region  lii  contains  the  most  frequently  executed 
path  B\  D%  D^,  and  region  Rz  contains  only  one  single  block,  B^  (shown  in  Figure 
5.1  (e)).  The  interference  graph  of  each  region  is  depicted  in  Figure  5.1(f).  The  fusion-based 
register  allocator  performs  graph  fusion  along  two  edges,  <f?i  ,i?3>  and  <Z?3,i74>.  After  Gm 
and  C?ii2  are  fused  along  edge  <BuBi>,  we  get  the  resulting  graph  that  is  the  same  as  the 
interference  graph  in  Figure  5.1(c).  Like  in  the  previous  example  using  blocks  as  regions, 
fusing  this  graph  along  edge  <Bi,B^>  generates  the  same  graph  as  the  interference  graph  in 
Figure  5.1(d).  Again  splitting  the  live  range  of  y  takes  place  to  maintain  the  simpliflability 
invariant.  There  are  also  10  incurred  shuffle  move  operations. 

As  soon  as  the  size  of  the  base  region  goes  up  to  4  blocks,  all  four  blocks  are  in  one 
region  R.  All  four  control-flow  edges  are  in  R,  no  gr^h  fusion  is  needed.  Gr  is  the  same 
as  the  interference  graph  in  Figure  5.1(d).  Simplification  of  the  interference  graph  blocks 
during  the  graph  simplification  phase  (Section  4.1.2).  Spilling  one  live  range  is  needed. 
The  live  range  of  x  has  the  smallest  spill  cost  (190  spill  code  operations).  The  spill  cost  of 
the  other  two  live  ranges,  x  and  y,  are  200.  Spilling  the  live  range  of  a: ,  therefore,  incurs  190 
spill  code  operations.  The  big  size  of  the  base  region  introduces  more  overhead  operations 
than  the  small  size  of  the  base  region. 

5.2.2  Favor  big  size  of  the  base  region 

Figure  5.2  (a)  is  the  code  depicted  in  Figure  4.14.  The  interference  graph  of  each  block 
is  shown  in  Figure  5.2(b).  The  lightly  shaded  live  ranges  are  transparent  live  ranges.  The 
register  pressure  of  B\  and  B^  is  3.  Regardless  the  size  of  the  base  region,  the  register 
allocator  has  to  make  out-of-color  spilling  decisions  for  the  base  regions  that  consist  of 
Bi  or  Bt.  If  the  size  of  the  base  region  is  small,  the  quality  of  register  allocation  may  be 
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(a)  (c) 

Figure  5.2:  Size  of  base  regions  (N=2). 


hurt  by  making  spilling  decisions  locally.  If  we  use  blocks  as  the  base  regions,  the  graph 
simplification  step  spills  Iri(x)  and  lr2(2^)  because  the  register  allocator  favors  spilling 
transparent  live  ranges.  Although  the  register  allocator  does  not  produce  spill  code  using 
the  spilling  decision  (spill  transparent  live  ranges),  it  pays  370  shuffle  load/store  operations 
(100  shuffle  stores  of  x  on  the  edge  entering  Du  90  shuffle  loads  of  a:  on  <Z?i,i?2>,  90 
shuffle  stores  of  on  and  90  shuffle  loads  of  on  <Bi,Di>). 

However,  if  we  put  all  three  blocks  into  one  region,  the  register  allocation  has  a  more 
global  view  so  as  to  make  wise  spilling  decisions.  The  interference  graph  of  the  region 
is  shown  in  Figure  5.2(c).  The  graph  simplification  step  then  spills  lr{x)  which  has  the 
smallest  spill  cost  (90)  and  is  composed  of  lr]{x)  and  /r2(a:).  The  spilling  decision  generates 
90  spill  code  operations  and  100  shuffle  store  operations  on  the  edge  that  enters  Z?i.  The 
big  size  of  the  base  region  introduces  fewer  overhead  operations  than  the  small  size  of  the 
base  region. 

5.2.3  Evaluation 

To  test  the  hypothesis  that  fusion-style  register  allocation  handles  scope-enlargement  grace¬ 
fully,  I  increase  the  scope  of  the  base  region  by  using  loop  bodies  as  base  regions.  In  Figure 
5.3,  I  report  the  performance  using  basic  blocks  and  loop  bodies  as  the  choice  of  base 
regions,  both  with  and  without  loop  unrolling.  Without  loop  unrolling,  the  performance 
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of  the  basic  block  regions  is  identical  to  the  performance  of  the  loop  body  regions.  When 
loops  are  unrolled,  however,  using  single  basic  blocks  as  the  base  regions  produces  fewer 
overhead  operations  —  using  the  larger  regions  formed  by  loop  bodies  after  loop  unrolling 
incurs  a  slight  penalty.  Nevertheless,  even  the  worst  case  for  fusion-style  register  allocation 
(i.e.,  the  case  where  the  base  regions  are  the  unrolled  loop  bodies)  produces  only  50%  of 
the  overhead  operations  generated  by  a  Chaitin-style  register  allocator.  In  the  best  case, 
using  fusion-style  register  allocation  results  in  only  12%  of  the  transfer  operations  required 
by  Chaitin-style  register  allocation. 


Figure  5.3:  Transfer  operations  in  alvinn,  using  all  registers,  for  two  different  region 
definitions. 

Figure  5.4  depicts  the  result  of  compiling  for  a  smaller  number  of  registers  (1 1  integer, 
9  floating  point  registers).  Notice  the  big  difference  in  the  number  of  overhead  operations 
(with  unrolling)  between  the  base  Chaitin-style  register  allocator  and  the  fusion-style  register 
allocator  (almost  a  factor  of  4).  The  choice  of  regions,  however,  has  no  noticeable  impact 
on  fusion-style  register  allocation;  there  just  are  not  enough  registers  to  keep  all  values  in 
registers.  We  notice  a  large  number  of  shuffle  operations  for  loop  unrolling  code,  indicating 
that  the  register  allocator  tries  hard  to  keep  values  in  registers,  even  if  this  means  that  a 
live  range  must  be  broken  into  smaller  segments.  From  Figures  5.3  and  5.4  we  see  how 
poorly  the  function-wide  approach  (Chaitin-style)  performs  compared  to  a  region-based 
framework: 

•  without  unrolling,  the  fusion-style  approach  executes  50%  of  the  operations  executed 
by  the  Chaitin-style  approach; 

•  with  unrolling,  using  the  fusion-style  approach  results  in  25%  of  the  operations 
required  by  the  Chaitin-style  allocator. 
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Furthermore,  when  the  scope  is  expanded  via  loop  unrolling,  the  fusion-style  approach,  in 
the  small-number-of-registers  scenario,  executes  only  60%  of  the  data  movement  operations 
executed  by  the  Chaitin-style  approach  when  given  all  available  registers. 
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Figure  5.4:  Transfer  operations  in  alvinn,  small  number  of  registers,  for  two  different 
region  definitions. 


5.3  Profiling 

The  figures  presented  so  far  in  this  section  were  obtained  using  static  information  for  edge 
ordering  and  heuristics.  Information  obtained  from  program  execution  can  sometimes 
improve  a  register  allocator’s  heuristics,  thereby  leading  to  a  better  register  allocation. 
Figure  5.5  contrasts  the  performance  of  a  Chaitin-style  register  allocator  for  alvinn  with 
two  variants  of  the  fusion-style  register  allocator  when  using  profile  information. 

For  all  approaches,  the  results  for  the  unrolled  versions  of  the  program  are  improved 
noticeably  by  using  profile  information.  Compare  the  results  for  Chaitin-style  allocation 
of  Figure  5.3  and  5.5  (the  two  left-most  columns).  We  see  that  in  the  unrolling  case, 
the  number  of  overhead  operations  is  cut  in  half  for  Chaitin-style  register  allocation  using 
profile  information  as  compared  to  Chaitin-style  register  allocation  using  static  information. 
Without  unrolling,  the  improvement  is  minor.  This  is  not  surprising.  With  unrolling,  there 
are  more  values  that  can  be  put  into  registers.  But  without  profile  information,  the  heuristics 
to  choose  spilling  candidates  misguides  the  register  allocator.  When  the  compiler  uses 
profile  information,  better  heuristic  choices  are  possible,  and  we  see  a  clear  improvement 
since  there  exists  many  candidates  to  pick  from  in  the  unrolled  code. 

Fusion-style  allocator  is  able  to  exploit  the  same  effect  (better  heuristics).  Since  fusion- 
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style  register  allocation  is  edge-order  sensitive,  we  can  expect  additional  effects  from  the 
improved  edge  ordering  that  is  obtained  from  the  use  of  profile  information.  Comparing 
Figure  5.3  with  Figure  5.5,  we  see  that  the  use  of  profile  information  reduces  the  number  of 
overhead  operations  by  50%  (for  the  unrolled  loops).  The  number  of  load  operations  due 
to  spilling  is  significantly  reduced,  and  we  see  also  a  reduction  in  the  number  of  operations 
to  save  or  restore  caller-save  registers.  This  reduction  of  spill  overhead  indicates  that  the 
profile  information  improved  the  heuristics  that  allow  the  register  allocator  to  make  better 
spilling  decisions.  The  change  in  caller-save  register  induced  overhead  implies  that  the 
profile  information  allowed  the  register  allocator  to  control  the  growth  of  live  ranges.  If 
there  are  sufficient  registers,  without  any  checks  and  balances,  fusion  grows  live  ranges 
until  a  live  range  conflicts  with  other  live  ranges.  As  the  live  range  grows,  it  is  more 
likely  to  include  function  calls.  These,  however,  incur  a  cost  if  the  value  is  in  a  caller-save 
register.  If  there  are  only  a  few  references  to  the  value,  then  it  might  be  better  to  leave  the 
live  range  in  memory.  Access  to  accurate  p»rofile  information  allows  the  register  allocator 
to  better  analyze  the  tradeoff  among  caller-save  registers,  callee-save  registers,  and  spilling 
(See  Chapter  6).  This  improved  decision  making  accounts  for  the  better  results  depicted 
in  Figure  5.5.  We  also  see  in  Figure  5.5  that  profile  information  results  only  in  minor 
improvements  if  we  use  blocks  as  base  regions. 
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Figure  5.5:  Transfer  operations  in  alvinn,  using  all  registers  and  profile  information. 


For  another  view,  I  present  in  Figure  5.6  the  results  from  compiling  espresso,  using 
the  high-register-pressure  scenario.  We  notice  that  the  Chaitin-style  register  allocator 
requires  significant  spill  loads,  whereas  the  fusion-style  allocator  causes  significant  shuffle 
activity  (some  of  these  operations  are  stores).  Furthermore,  looking  at  the  influence  of  the 
choice  of  the  base  region,  we  notice  that  basic  blocks  as  base  regions  provide  only  slightly 


94 


CHAPTERS.  RECONCILIATION 


better,  but  vastly  different  results.  The  number  of  spill  loads  is  reduced  dramatically,  which 
indicates  that  the  fine-grain  base  allows  the  register  allocator  to  keep  values  in  a  register  that 
end  up  in  memory  using  the  slightly  coarser-grain  base  of  a  loop  nest.  In  other  words,  the 
register  allocator  is  able  to  split  live  ranges,  but  moving  values  from  one  storage  location  to 
another  implies  shuffle  code.  About  half  of  the  shuffle  code  are  load  operations,  the  other 
half  stores.  Register  moves  play  almost  no  role  for  this  program.  The  discussion  of  the 
differences  between  basic  blocks  and  loop  nests  as  a  base  region  extends  to  using  complete 
functions  as  abase  region,  as  is  done  in  a  standard  Chaitin-style  allocator.  We  see  again  that 
a  finer-grain  choice  of  base  region  (loops  vs.  functions)  reduces  the  number  of  overhead 
operations. 


Figure  5.6;  Transfer  operations  in  espresso,  no  unrolling. 

Figure  5.7  depicts  the  result  when  compiling  the  programs  with  loop  unrolling.  Clearly, 
expanding  the  scope  implies  a  penalty  for  register  allocation.  The  fusion-style  approach, 
however,  requires  only  half  of  the  overhead  operations  that  are  inserted  by  a  Chaitin-style 
allocator.  Again,  profile  information  helps  although  not  as  much  as  in  the  absence  of 
unrolling. 
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Figure  5.7:  Transfer  operations  in  espresso,  with  unrolling. 
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5.4  Sensitivity  to  number  of  registers 

The  transformations  of  region-based  compilers  increase  register  pressure,  and  another  way 
to  assess  how  a  register  allocator  deals  with  register  pressure  is  to  investigate  how  sensitive 
the  allocator  is  to  the  number  of  registers  that  are  available.  Clearly,  as  we  increase  the 
number  of  registers,  there  comes  the  point  of  diminishing  return.  The  next  figures  depict  the 
number  of  overhead  operations  as  we  allow  the  register  allocator  to  use  more  registers  (the 
number  of  additional  registers  is  shown  on  the  X  axis).  Recall  that  the  MIPS  architecture  has 
separate  register  banks  for  integer  and  floating-point  values.  The  notation  (Ri,Rf,Bi,B/) 
of  the  next  figures  shows  the  combination  of  caller-save  and  callee-save  registers  for  the 
two  register  banks.  and  R/  are  the  number  of  caller-save  registers  for  integer  and 
floating-point  values,  respectively.  Ei  and  i?/  are  the  number  of  callee-save  registers  for 
integer  and  floating-point  values.  The  standard  calling  convention  of  the  MIPS  machine 
uses  4  registers  of  the  integer  bank  to  pass  arguments  and  2  registers  to  pass  function-return 
values;  in  addition,  2  floating-point  registers  pass  arguments  and  2  floating-point  registers 
can  hold  a  function-return  value.  All  these  registers  are  caller-save  registers.  Hence,  the 
register  allocator  uses  at  least  (6,4, 0,0). 

Looking  at  the  SPEC92  programs,  we  can  classify  them  roughly  into  5  classes: 

1.  Fusion-style  >  Chaitin-style:  Fusion-style  register  allocation  produces  consistently 
better  results  than  Chaitin-style  allocation.  Examples  are  alvinn,  espresso, 
nasaV,  and  tomcatv  (Figure  5.8). 

2.  Fusion-style  «  Chaitin-style  (profile  sensitive):  Programs  sensitive  to  getting  accurate 
frequency  information  (i.e.,  given  the  same  frequency  information,  both  fusion-style 
and  Chaitin-style  register  allocation  perform  about  the  same),  compress,  sc,  li, 
and  gcc  are  four  examples  (Figure  5.9). 

3.  Fusion-style  >  Chaitin-style  (static  case):  The  heuristics  that  determine  which  live 
range  to  spill  are  sometimes  helped  by  profile  information,  and  in  this  case,  a  Chaitin- 
style  allocator  may  obtain  the  same  results  as  fusion-style  allocation;  see  Figure  5.10 
for  an  example  (f  pppp).  The  results  for  the  Chaitin-style  allocator  are  marginally 
better  than  the  result  for  fusion-style  allocation.  Even  with  profile  information,  the 


5.5.  DIFFERENT  REGION  SIZES 


97 


register  allocator  must  make  many  guesses,  and  results  like  those  depicted  in  Figure 
5.10  indicate  that  the  spilling  decision  still  depends  on  some  heuristics. 

4.  Fusion-style  >  Chaitin-style  (small  number  of  registers):  Fusion-style  register  al¬ 
location  produces  fewer  overhead  operations  than  Chaitin-style  allocation  when  the 
register  allocator  uses  a  small  number  of  registers.  As  the  number  of  register  in¬ 
creases,  the  difference  of  register  overhead  diminishes,  ear  and  matrixSOO  are 
two  examples  (Figure  5.11). 

5.  Fusion-style  w  Chaitin-style:  Finally,  there  are  programs  where  neither  the  choice 
of  register  allocation  strategy  nor  the  use  of  profile  information  seems  to  make  a 
difference.  Examples  of  this  class  of  programs  are  eqntott,  doduc,  and  spice 
shown  in  Figure  5.12. 


5.5  Different  region  sizes 

In  this  section,  we  evaluate  the  influence  of  different  region  sizes  to  fusion-based  coloring. 
Three  kinds  of  region  are  investigated,  blocks,  loops,  and  traces.  The  region-formation 
phase  determines  a  trace  by  selecting  a  seed  block  and  growing  the  trace  along  desirable 
paths.  A  trace  T  is  expanded  in  the  forward  as  well  as  the  backward  direction: 


•  forward:  T  is  expanded  forward  by  adding  block  Dk  to 'T  if  the  condition  is  satisfied. 


3E  =<  Dj,  Dk  >  ABj  €  T  A 


freqjE) 

frcq{Df) 


>3  A 


frcqjPk) 

frcq{sccd) 


•  backward:  T  is  expanded  backward  by  adding  block  Bi  to  T  if  the  condition  is 
satisfied. 


3E  =<  Bj  >  ABj  €  7^  A 


freqjE) 

freqjBi) 


>  3  A 


freqjBi) 

frcqjseed) 


frcqjBj)  and  freqjE)  are  the  estimated  execution  frequencies  of  block  Bj  and  edge 
E  respectively.  The  two  constants,  3  and  7,  are  set  to  0.5.  These  two  conditions  are  the 
conditions  used  in  the  IMPACT  compiler  to  form  regions  [34]. 


Operations  (Million)  Operations  (Million) 
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Chaitin-style 

-i>-  Fusion-style 

Chaitin-style 

Fusion-style 

(Profile) 

(Profile) 

nasaV  alvinn 


tomcatv  espresso 


Figure  5.8:  Fusion-style  >  Chaitin-style, 
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matrixSOO 


nasaV 


Figure  5.14:  Different  region  size  (cont.) 
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tomcatv 


Figure  5.15:  Different  region  size  (cont.). 


Figures  5.13, 5.14,  and5.15  depict  the  experimental  results  offusion-style  coloring  with 
three  different  region  sizes.  There  are  several  noticeable  aspects.  None  of  the  region  sizes 
always  outperforms  the  others  because  the  results  highly  depend  on  program  characteristics 
and  combination  of  registers.  The  influence  diminishes  as  the  number  of  registers  increases 
because  the  number  of  registers  is  then  sufficient  for  all  different  region  sizes.  In  general, 
alvinn,  espresso,  gcc,  and  doduc  prefer  a  small  size  of  the  base  region.  For  gcc 
and  doduc,  the  results  of  using  traces  as  regions  are  as  good  as  using  blocks,  f  pppp, 
nasaV,  and  tomcatv  prefer  a  big  size  of  the  base  region  (loops  or  traces). 


5.6  Loop  unrolling 

Loop  unrolling  is  the  code  transformation  that  exposes  multiple  loop  iterations  to  the 
compiler.  Classical  global  optimizations  such  as  constant  folding,  common  subexpression 
elimination,  and  strength  reduction  can  be  applied  across  different  loop  iterations.  Loop 
branches  and  exits  inside  unrolled  loop  bodies  may  be  eliminated.  Code  scheduling  can  take 
advantage  of  the  bigger  scope  of  loop  bodies  and  extract  ILP  among  multiple  loop  iterations . 
However,  the  beneficial  opportunities  to  global  optimizations  and  code  scheduling  do 
not  come  for  free  because  of  increased  register  pressure.  More  common  subexpression 
elimination  opportunities  imply  that  more  live  ranges  are  needed  for  the  temporaries  to 
hold  the  increased  number  of  common  subexpressions.  Strength  reduction  replaces  a  more 
expensive  (more  cycles)  expression  by  a  cheaper  (fewer  cycles)  expression.  A  temporary 
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that  is  live  throughout  the  loop  is  required  to  hold  the  value  of  the  expression.  Hence,  the 
more  expensive  expressions  are  replaced  by  cheaper  expressions,  the  more  temporaries  (live 
ranges)  are  introduced.  All  in  all,  the  larger  scope  of  unrolled  loop  bodies  implies  that  more 
live  ranges  are  required  because  there  exist  more  values,  variables,  and  expressions.  More 
aggressive  code  scheduling  implies  that  the  number  of  live  ranges  that  are  simultaneously 
live  increases  because  more  otherwise  disjoint  live  ranges  now  overlap. 

I  run  experiments  with  different  combinations  of  register  for  alvinn,  espresso, 
nasa7,  and  tomcatv  with  unrolling  to  see  how  graciously  Chaitin-style  and  fusion-style 
coloring  behave  in  the  presence  of  unrolling.  To  allow  an  easy  comparison,  the  results  of 
Figure  5.8  are  repeated  in  Figures  5.16  and  5.17,  There  are  two  interesting  aspects.  First, 
unrolling  does  not  impose  serious  register  pressure  for  some  SPEC92  programs.  For  nas  a7 
and  tomcatv  in  both  the  static  and  dynamic  cases,  Chaitin-style  and  fusion-based  register 
allocation  with  unrolling  produce  a  similar  number  of  overhead  operations  as  without 
unrolling.  Second,  unrolling  generates  register  pressure  for  alvinn  and  espresso. 
We  see  that  the  curves  of  Chaitin-style  and  fusion-style  register  allocation  with  unrolling 
diverge  further  apart  than  without  unrolling.  For  alvinn,  fusion-style  coloring  with 
unrolling  generates  twice  as  many  overhead  operations  as  without  unrolling  (25  %  less  than 
Chaitin-style  with  unrolling)  when  the  number  of  registers  is  small.  Nonetheless,  as  the 
number  of  registers  increases,  fusion-style  coloring  is  able  to  deal  with  the  high  pressure 
caused  by  unrolling  and  produces  roughly  the  same  number  of  overhead  operations  as 
without  unrolling.  When  analyzing  the  increase  of  overhead  operations  for  Chaitin-style 
with  unrolling  relative  to  without  unrolling,  we  see  that  Chaitin-style  coloring  is  not  able 
to  handle  the  increased  register  pressure  benignly  for  alvinn.  The  number  of  overhead 
operations  for  unrolling  using  profile  information  grows  only  50  %  at  (7,5 ,0,0)  but  is  tripled 
for  the  other  combinations  of  registers.  The  number  of  overhead  operations  in  the  static 
case  increases  less  than  80  %  when  the  number  of  registers  is  small  (prior  to  (8,6,2, 2))  but 
is  quadrupled  when  the  number  of  registers  exceeds  (8,6,2,2).  For  espresso,  the  gap 
between  Chaitin-style  and  fusion-style  enlarges  and  the  merge  points  of  the  curves  shift  to 
the  right.  Here  we  see  that  fusion-style  handles  high  register  pressure  graciously. 
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Figure  5.17:  Overhead  operations  of  register  allocation  for  loop  unrolling  (cont.). 

5.7  Function  inlining 

Function  inlining  is  a  code  transformation  that  expands  the  compilation  scope  beyond 
function  boundaries.  Functions  are  artificial  boundaries  induced  by  programmers.  Those 
boundaries  may  not  be  the  best  partitioning  borders  as  far  as  compilers  are  concerned. 
Function  inlining  affects  the  compiler  in  several  aspects: 

•  global  optimization:  Compilers  can  improve  the  quality  of  a  program  once  they 
know  that  certain  values  are  constants  in  the  program.  Constant  folding  can  calculate 
the  value  of  an  expression  at  compile  time  instead  of  run  time  [2].  Branch  folding 
eliminates  a  conditional  branch  if  the  result  of  the  comparison  of  the  branch  is 
known  at  compile  time.  Loop  unrolling  prefers  constant  loop  bounds.  When  a 
constant  is  passing  to  a  function  call  as  an  argument,  without  inter-procedural  constant 
propagation  [11, 30],  the  compiler  must  be  conservative  and  assume  that  the  argument 
is  not  a  constant.  Consequently,  some  optimization  opportunities  are  lost. 

Function  inlining  also  exposes  more  opportunities  to  the  global  optimizer  to  find 
common  subexpression  between  the  caller  and  callee.  Moreover,  functions  are  no 
longer  barriers  for  code  motion  [36],  e.g.  a  loop  invariant  of  the  callee  can  be  moved 
outside  a  loop  of  the  caller. 

•  memory  aliasing:  Without  inter-procedural  analysis,  it  is  impossible  for  compilers  to 
get  the  information  regarding  incoming  arguments.  In  such  a  case,  compilers  usually 
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make  conservative  assumptions  on  memory  aliasing;  that  is,  two  memory  accesses 
always  alias  if  both  use  incoming  arguments  as  base  addresses.  Conservative  memory 
aliasing  constrains  code  scheduling  to  move  instructions —  two  memory  accesses 
must  be  issued  in  order.  Function  inlining  replaces  formal  parameters  by  their 
actual  parameters.  The  memory  aliasing  information  is  improved  (more  precise)  as  a 
result  [3]. 

•  ILP:  Without  function  inlining,  the  compiler  is  limited  by  the  traditional  scope  of  the 
compilation  unit  (function).  The  bigger  the  compilation  scope,  the  more  likely  code 
scheduling  is  able  to  exploit  ILP  (more  likely  to  find  independent  instructions).  [33] 
shows  that  some  C  programs  spend  80  %  of  their  execution  time  in  functions  with 
fewer  than  250  operations  prior  to  inlining.  After  inlining,  those  C  programs  spend 
50  %  of  their  execution  time  in  functions  with  more  than  1000  operations. 

•  calling  convention:  Passing  parameters  to  the  callee  and  returning  a  value  to  the  caller 
are  usually  done  via  argument  registers  and  stack.  For  example,  the  caller  moves 
outgoing  arguments,  if  there  are  any,  to  argument  registers  or  pushes  the  arguments 
onto  the  local  stack  if  all  argument  registers  are  used  up  by  prior  arguments.  The 
callee  then  follows  the  same  convention  to  get  actual  parameters  and  moves  them 
to  their  corresponding  formal  parameters.  Function  inlining  eliminates  the  overhead 
for  moving  arguments,  returning  function  values,  adjusting  the  stack  frame,  and 
saving/restoring  the  return  address. 

•  register  allocation:  Two  counteractive  factors  are  happening  to  register  allocation 
when  functions  are  inlined;  increment  of  spill  cost  and  decrement  of  call  cost.  When 
a  function  call,  bar  { ) ,  is  replaced  by  the  function  body  of  bar,  the  number  of 
live  ranges  in  the  current  function  increases  and  all  live  ranges  that  contain  the  call 
originally  interfere  with  all  new  live  ranges  introduced  by  bar.  In  other  words, 
function  inlining  increases  register  pressure  and  hereby  potentially  introduces  more 
spill  code  operations.  Nevertheless,  if  the  compiler  divides  the  register  file  into  callee- 
save  and  caller-save  registers,  the  call  cost  may  be  reduced  because  the  function  call 
boundaries  no  longer  exist.  As  long  as  the  reduction  of  the  call  cost  exceeds  the 
increase  of  the  spill  cost,  register  allocation  benefits  by  inlining. 
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Figure  5.18  sketches  an  example  in  which  register  aliocation  benefits  by  inlining.  In 
Figure  5.18  (a),  function  f  oo  contains  a  loop  body  that  is  executed  many  times  and 
has  a  call  to  a  leaf  function  bar.  The  shaded  regions  are  the  most  frequently  executed 
paths  of  the  two  functions.  Hence,  there  is  high  cost  to  save/restore  a  callee-save 
register  at  the  entry/exit  of  bar;  there  is  high  cost  to  save/restore  a  caller-save  register 
around  the  call  site  bar  { ) .  That  is,  assigning  a  callee-save  register  to  a  live  range 
in  bar  needs  to  pay  high  callee-save  cost;  assigning  a  caller-save  register  to  a  live 
range  containing  the  call  bar  ( )  needs  to  pay  high  caller-save  cost.  After  function 
bar  is  inlined,  the  necessity  of  saving/restoring  registers  at  bar’s  function  boundary 
disappears  (shown  in  Figure  5.18(b)).  foo  is  now  a  leaf  function  (no  caller-save 
cost)  and  the  only  function  boundary  for  the  code.  Therefore,  assigning  callee-save 
registers  to  the  live  ranges  that  were  in  bar  pay  the  callee-save  cost  at  f  oo’s  function 
boundary  which  is  less  frequently  executed  than  bar’s;  assigning  caller-save  registers 
to  the  live  ranges  that  contained  the  call  bar  ( )  pays  no  caller-save  cost. 

foo(){ 

prologue 


foo  (  )  { 
prologue 


s 

epilogue 


epilogue 

} 

(b) 


Figure  5.18:  Function  inlining. 

•  compilation:  As  functions  are  inlined,  the  code  size  grows.  Consequently,  the 
complexity  of  the  compilation  time  and  memory  requirement  increase  as  well  [33]. 
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The  compiler  has  to  control  the  complexity  so  that  the  compilation  time  and  memory 
requirement  are  acceptable  and  practical. 

In  this  Section,  I  evaluate  the  influence  of  inlining  on  register  allocation.  Only  data 
of  alvinn,  ear,  espresso,  compress,  and  sc  are  presented  because  the  cmcc 
compiler,  presently,  does  not  have  the  infrastructure  for  doing  function  inlining.  It  would 
be  very  tedious  and  inefficient  to  analyze  and  inline  functions  manually,  espresso  has 
one  compilation  flag  "-DNO  JNLINE".  By  default,  some  frequently  executed  functions  are 
inlined.  To  increase  register  pressure,  some  functions  of  alvinn,  ear,  compress,  and 
sc  are  inlined  manually.  All  functions  (6  functions)  of  alvinn  are  inlined  into  the  main 
function.  All  performance  data  of  alvinn,  ear,  and  espresso  in  this  thesis  are  the 
inlined  versions  unless  it  is  explicitly  stated  that  the  non’inlined  versions  are  used. 

Register  overhead  with  and  without  inlining  are  given  in  Figure  5.19  and  Figure  5.20. 
For  easy  comparison,  the  register  overhead  figures  of  the  inlined  versions  of  alvinn, 
ear,  and  espresso,  and  the  non-inlined  versions  of  compress  and  sc  are  taken  from 
Section  5.4.  There  are  several  observable  aspects.  The  curves  of  Chai tin- style  and  fusion- 
style  without  inlining  either  merge  quickly  (espresso)  or  are  tightenly  together  (alvinn, 
ear,  compress,  and  sc).  Second,  ear  and  espresso  with  Mining  produce  fewer 
overhead  operations  than  without  inlining  for  both  Chaitin- style  and  fusion-style  coloring 
because  the  reduction  of  the  call  cost  exceeds  the  increase  of  the  spill  cost.  The  overhead 
operations  of  alvinn,  compress,  and  sc  with  Mining  for  both  Chaitin-style  and  fusion- 
style  coloring  increase  because  the  increment  of  the  spill  cost  is  bigger  than  the  decrement 
of  the  call  cost.  Third,  fusion-style  not  only  benefits  from  the  reduction  of  call  cost,  but 
also  handles  big  compilation  scopes  well  and  splits  live  ranges  at  low  frequently  executed 
paths  to  reduce  spill  cost  (live-range  splitting  is  very  effective  in  the  presence  of  Mining). 
For  these  reasons,  the  gap  between  the  curves  of  Chaitin-style  and  fusion-style  register 
allocation  enlarges.  Examples  are  alvinn,  espresso,  and  compress. 


5.8  Execution  time  and  compilation  time 

There  are  two  other  dimensions  of  interest  to  compiler  designers:  execution  time  related  to 
register  overhead  and  compilation  time.  Table  5.1  shows  the  speedup  due  to  fusion-style 
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alvinn  (no  inlining)  alvinn  (inlining) 


ear  (no  inlining)  ear  (inlining) 


espresso  (no  inlining)  espresso  (inlining) 

Figure  5.19:  Function  inlining  vs.  Without  inlining. 
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program 

All  registers 

14  int,  14  float 

espresso 

b.o 

7.5 

nasaV 

2.4 

5.6 

tomcatv 

8.4 

7.0 

Table  5.1:  Speed-up  of  execution  time  over  Chaitin-style  coloring  (percentage). 

register  allocation  (compared  to  a  Chaitin-style  register  allocator  with  various  enhance¬ 
ments).  These  data  are  for  the  MIPS  architecture;  the  execution  time  has  been  measured 
on  a  DECstation  5000.  The  table  shows  that  fusion-style  coloring  can  speed-up  execution 
time  by  up  to  8.4  %  over  Chaitin-style  coloring. 

Table  5.2  shows  the  cost  of  fusion-style  register  allocation;  this  table  lists  the  contribution 
of  register  allocation  to  overall  compilation  time  (seconds)  for  the  cmcc  compiler,  citicc  is  a 
research  compiler  and  not  fine-tuned  for  compilation  speed;  overall  cmcc  takes  about  twice 
as  long  as  the  native  compiler.  The  data  are  collected  on  a  DEC  AlphaStation  (333MHz). 
We  compile  the  SPEC92  programs  and  measure  the  compilation  time  for  Chaitin-style  and 
fusion-style  register  allocation,  using  loops  as  well  as  single  basic  blocks  as  base  regions. 
We  also  stress  register  allocation  by  performing  the  unrolling  transformation  (unroll  3  times) 
to  enlarge  the  code  size  for  some  of  the  programs. 

The  complexity  of  simplification  is  0(1^  -f  f)  where  1^  is  the  number  of  live  ranges  and  £ 
is  the  number  of  interference  edges.  After  fusing  graphs  along  an  edge,  fusion-based  register 
allocation  performs  simplification  to  maintain  the  simplifiability  invariant.  Hence,  at  first 
sight,  we  expect  to  see  that  the  fusion-based  approach  has  a  big  increase  of  compilation 
time  because  the  complexity  of  maintaining  the  simplifiability  constraint  is  Cl(  *  (V'  -I-  £)) 
where  T  is  the  number  of  fusion  operations.  From  the  table,  however,  we  see  that — except 
for  gcc — ^register  allocation  time  increases  oniy  by  a  small  constant  factor  (<  2),  even 
when  the  base  regions  are  basic  blocks  and  the  compiler  applies  a  fusion  operation  on  every 
control-flow  edge.  For  gcc  (c-parse.tab.c),  the  size  of  the  control-flow  graph  of  the  yylex 
routine  is  tripled  by  loop  unrolling  (  unroll  3  times).  The  size  grows  from  (415  blocks, 
660  edges)  to  (1426  blocks,  2246  edges).  Even  in  such  a  big  control-flow  graph,  the  cost 
of  the  fusion  operations  increases  only  about  by  a  factor  of  3  if  the  base  regions  are  loops. 
To  put  this  data  into  perspective,  the  complete  code  generator  including  register  allocation 
is  responsible  for  less  than  20%  of  the  total  compilation  time.  So  although  fusion-style 
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program 

Chaitin 

Region 

LOOP 

BB 

alvinn  inline 

1.3 

1.6 

1.6 

alvinn  inline  unroll 

4.8 

5.2 

5.7 

compress 

1.7 

2.0 

4.4 

compress  unroll 

3.9 

5.2 

11.9 

ear 

4.2 

5.5 

8.2 

ear  unroll 

15.7 

26.6 

49.5 

eqntott 

4.7 

6.1 

10.7 

eqntott  unroll 

18.3 

29.3 

74.4 

espresso 

29.3 

39.6 

71.6 

espresso  unroll 

104.0 

175.6 

417.2 

li 

5.0 

5.9 

9.4 

li  unroll 

11.7 

16.5 

32.8 

sc 

15.1 

25.5 

62.8 

sc  unroll 

33.2 

71.2 

160.1 

gcc 

148.7 

211.7 

669.7 

gcc  (c-parse.tab.c) 

7.0 

14.7 

61.2 

gcc  (c-parse.tab.c)  unroll 

22.1 

55.0 

166.7 

doduc 

51.4 

78.1 

128.8 

fpppp 

60.6 

70.8 

83.6 

matrixSOO 

1.1 

1.7 

2.5 

matrixSOO  unroll 

4.1 

7.5 

11.0 

nasa7 

7.4 

10.9 

13.8 

nasa7  unroll 

55.2 

80.1 

100.9 

spice 

118.2 

311.0 

392.4 

tomcatv 

3.4 

3.7 

3.7 

tomcatv  unroll 

12.7 

17.2 

30.7 

Table  5,2:  Compilation  time  of  register  allocation  (seconds). 


allocation  incurs  a  price  in  compilation  time,  the  increase  is  acceptable. 

There  are  two  major  factors  that  keep  the  compilation  time  spent  on  register  allocation 
low:  partial  graphs  and  cliques.  When  graphs  are  fused  along  edge  E  that  connects  two 
regions,  Ri  and  the  fusion  operation  needs  to  deal  with  only  Gui  and  G^2  not  all  the 
graphs  in  the  program.  Clique  summary  nodes  also  help  to  speed  up  the  compilation  time 
spent  on  simplification  because  simplifying  a  clique  removes  all  transparent  live  ranges  in 
the  clique  from  the  graph.  Consider  that  a  clique  G  contains  T  transparent  live  ranges. 
Each  transparent  live  range  in  C  interferes  with  T  -  1  transparent  live  ranges  in  G  and 
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ncighbor(C).  Therefore  C  represents  T  live  ranges  and  +  T*  |  ncighbor(C)  |, 

v^here  |  ncighbor(C)  |  is  the  number  of  neighbors  of  C\  Removing  C  and  ',p(C)  from  the 
interf erence  graph  is  equivalent  to  remo  ving  T  live  ranges  and  ^  |  ncighbor(C)  | 

interference  edges.  These  two  reasons  keep  the  price  of  compilation  time  within  a  reasonable 
and  acceptable  range. 

Fusion-based  coloring  provides  a  nice  framework  to  tackle  the  compilation  time  problem 
if  the  compilation  time  ever  becomes  unacceptable  or  is  a  big  concern  to  the  compiler.  The 
fusion  operation  takes  longer  and  longer  as  more  and  more  graphs  have  been  fused  because 
the  fusion  operation  gradually  deals  with  bigger  graphs,  which  cover  larger  regions,  and 
makes  all  delayed  spilling  decisions  (live  ranges  are  removed  from  cliques).  Graph  fusion 
is  based  on  the  edge  ordering,  so  fusing  graphs  on  the  edges  that  are  less  important  (less 
frequently  executed)  is  more  expensive  in  term  of  compilation  time,  yet  has  less  influence 
on  the  overall  register  allocation  overhead.  To  curtail  the  time  spent  on  graph  fusion,  we 
can  set  a  threshold  to  control  the  number  of  control-flow  edges  along  which  graphs  are 
fused.  The  threshold  can  be  based  on  the  time  that  has  already  been  spent  on  register 
allocation  or  the  number  of  the  total  control-flow  edges  that  connect  regions.  As  soon  as 
register  allocation  passes  the  threshold,  the  graphs  of  the  regions  that  are  not  yet  fused  with 
any  other  graphs  are  all  lumped  into  one  big  region  Rremainmg  (splitting  live  ranges  is  not 
allowed  within  the  region).  Building  the  graph  is  as  fast  as  building  the  graph 

for  Chaitin-style  coloring.  We  then  apply  the  fusion  operation  to  the  edges  that  connect 
Rremaining  and  Other  regions. 

The  example  in  Figure  5.21  is  a  code  fragment  of  compress.  The  shaded  region 
is  the  most  frequent  path.  We  can  apply  graph  fusion  to  the  region.  The  fusion-style 
coloring  is  able  to  further  divide  the  region  into  smaller  sub-regions  so  that  finer  granularity 
live-range  splitting  is  thereby  possible  within  the  region.  The  remaining  blocks  belong  to 
one  region,  Rremaining  ^  The  register  allocator  treats  Rremaining  as  a  typical  region.  The 
interference  graph,  is  constructed,  the  graph  is  simplified  so  as  to  make  sure 

that  the  simplifiability  invariant  is  satisfied.  The  register  allocator  can  either  apply  the 
fusion  operation  on  edges,  Ei,  E^,  E4,  Es,  and  Ee,  to  reconcile  the  two  regions,  or  does 
nothing  and  let  the  color-assignment  phase  eliminate  shuffle  code  using  biased  coloring. 

Tables  5.3  and  5.4  show  experimental  results  of  the  compilation  time  versus  register 
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Fusion% 

Time 

Fusion  ^  jQQ 

^Chaitin 

Overhead 

Chaitin 
^  Fusion 

(14i,14f) 

(lli,8f) 

(14i,14f) 

(lli,8f)  1 

100 

128.0 

120.6 

62.6 

59.5 

90 

122.0 

114.9  n 

58.4 

16.6 

80 

116.3 

112.0 

44.2 

14.2 

70 

106.6 

100.9 

31.9 

4.7 

60 

93.1 

87.4 

27.3 

7.8 

50 

78.9  ^ 

72.2 

6.3 

-4.1 

40 

66.0 

60.7 

7.4 

-5.8 

30 

51.2 

47.9 

6.9 

3.8 

20 

36.4 

33.8 

12.0 

7.7 

10 

21.4 

20.3 

2.8 

7.8 

Table  5.3:  Percentage  increase  of  compilation  time  and  improvement  of  overhead  operations 
(espresso). 


Fusion% 

Time 

1)*100 

^Chaitin 

Overhead 

Chaitin 
^  Fusion 

(14i,14f) 

(lli,8f) 

(14i,14f) 

(lli,8f) 

100 

128.0 

120.6 

62.6 

59.5 

90 

126.8 

118.0 

62.6 

56.6 

80  1 

117.2 

110.6 

59.2 

42.5 

70 

107.5 

100.9 

45.1 

36.0 

60 

97.0 

91.7 

31.5 

25.4 

50 

85.2  1 

77.7 

26.9 

24.3 

40 

72.0 

65.6 

24.8 

18.4 

30 

57.2 

51.3 

16.6 

14.9 

20 

41.9 

38.7 

16.6 

12.7 

10 

25.6 

23.8 

11.6 

13.0 

Table  5 .4:  Percentage  increase  of  compilation  time  and  improvement  of  overhead  operations 
(espresso). 
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overhead  operations  for  espresso.  For  the  experiment,  fusion- style  coloring  uses  blocks 
as  the  base  regions  and  profile  information  to  guide  register  allocation.  Data  for  two 
register  combinations,  14  integer  and  14  floating-point,  and  8  integer  and  11  floating-point 
registers,  are  collected.  The  left-most  column  is  the  percentage  of  total  control-flow  edges 
along  which  fusion-style  coloring  applies  fusion  operations.  The  ’Time"  columns  indicate 
the  percentage  increase  of  compilation  time  of  fusion-style  coloring  relative  to  Chatin-style 
coloring.  The  "Overhead"  columns  are  the  percentage  reduction  of  overhead  operation  over 
Chaitin-style  coloring  when  we  use  fusion-style  coloring. 

Table  5.3  is  the  result  of  applying  fusion  operations  on  the  first  k  percentage  of  total 
control-flow  edges.  During  the  region  formation  phase  in  Figure  4.1,  the  register  allocator 
collects  blocks  that  are  connected  by  the  last  100  ~  k  percentage  of  control-flow  edges  in 
the  edge  ordering  into  one  region  R.  That  is,  those  edges  are  in  the  region  R.  When  looking 
at  the  overhead  columns,  we  can  observe  one  noticeable  aspect  that  for  a  cutoff  point  of 
50%  in  the  (14i,14f)  case,  the  improvement  due  to  fusion  drops  to  a  few  percentage  over 
Chaitin-style  coloring.  There  is  a  big  overhead  improvement  drop  for  the  (1  li,8f)  case  even 
though  we  apply  fusion  operations  on  90%  of  edges.  We  also  see  cases  where  fusion-style 
coloring  produces  more  overhead  operations  than  Chaitin-style  coloring.  The  reason  is  that 
the  region  R  contains  frequently  executed  blocks.  Consider  the  example  of  Figure  5.21.  If 
Ee  is  R,  then  R  contains  The  register  pressure  of  infrequently  parts  of  R,  therefore, 
may  cause  shuffle  code  on  frequently  executed  edges,  <Z?2»R3>  and  <R3,Ri>.  Furthermore, 
collecting  infrequently  executed  edges  into  R  deprives  the  opportunities  of  placing  shuffle 
code  on  those  edges.  Here  we  see  the  case  that  a  bad  choice  of  regions  impacts  the  quality 
of  register  allocation  badly.  Hence,  we  take  another  approach  that  collects  the  blocks 
connected  by  the  first  k  percentage  of  edges  into  one  region.  In  other  words,  the  region  is 
frequently  executed.  The  result  is  shown  in  Table  5.4.  This  table  indicates  that  the  approach 
is  able  to  maintain  good  overhead  improvement  while  reducing  compilation  time. 


5.9  Summary 

In  this  chapter,  I  have  discussed  four  aspects  of  register  allocation:  the  size  of  base  region, 
profile  information,  sensitivity  to  register  pressure,  and  sensitivity  to  unrolling  and  inlining 
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code  transformations.  The  results  confirm  that  fusion-style  register  allocation  keeps  the 
overhead  operations  under  control.  Even  in  the  presence  of  scope  expanding  transforma¬ 
tions,  loop  unrolling  and  function  inlining,  the  number  of  overhead  operations  generated 
by  the  fusion-based  approach  either  grows  slowly  or  is  reduced  by  splitting  live  ranges  at 
infrequently  executed  edges.  These  results  indicate  that  register  allocation  is  no  longer  an 
obstacle  to  region-based  compilation.  Compilers  that  use  some  form  of  region  for  other 
optimization  phases  now  have  a  path  to  reconcile  register  allocation  with  region-based 
optimization. 

Furthermore,  our  results  indicate  that  profile  information  improves  the  performance 
of  the  register  allocator.  As  gathering  profile  information  becomes  more  common  in 
optimizing  compilers,  fusion-style  register  allocation  becomes  increasingly  attractive,  since 
fusion-style  allocation  is  well-prepared  to  directly  take  advantage  of  profile  information. 
As  the  implementations  of  high-performance  processors  include  more  and  more  hardware 
support  for  dynamic  instruction  reordering,  optimizers  and  schedulers  in  the  compiler  have 
to  analyze  larger  and  larger  regions  of  a  program.  Region-based  compilation  provides  an 
attractive  framework  to  organize  these  compilations.  With  fusion-style  register  allocation, 
the  compiler  can  now  include  register  allocation  in  a  region-based  framework. 


Chapter  6 

Call- Cost  Directed  Register  Allocation 

The  register  allocation  cost  is  the  sum  of  three  components:  spill,  call,  and  shuffle  cost. 
The  register  allocation  cost  must  include  the  call  cost.  If  the  register  allocator  focuses  on 
the  spill  cost  alone,  the  register  allocator  may  have  an  overly  optimistic  image  of  the  task. 
The  call  cost,  however,  is  influenced  by  the  compiler’s  calling  convention.  Many  compilers 
divide  the  registers  into  two  sets,  callee-save  and  caller-save  registers,  respectively. 

Choosing  the  right  kind  of  register  for  live  ranges  plays  a  major  role  in  eliminating  the 
register-allocation  overhead  when  the  compiled  function  is  frequently  executed  or  function 
calls  are  on  the  most  frequently  executed  paths.  Picking  the  wrong  kind  of  register  for  a 
live  range  incurs  a  high  penalty  that  may  dominate  the  total  overhead  of  register  allocation. 

In  this  chapter,  I  present  three  improvements,  storage-class  analysis,  benefit-driven 
simplification,  and  preference  decision  that  are  effective  in  selecting  the  right  kind  of 
register  for  a  live  range.  Also  I  describe  one  enhancement  to  fusion-style  coloring  that 
takes  call  and  shuffle  cost  into  account  to  make  splitting  decisions.  In  Section  6.1,  I 
present  an  allocator  based  on  Chaitin-style  coloring  with  a  simple  cost  model.  This  register 
allocator  is  used  as  the  base  for  performance  comparison.  In  Section  6.2, 1  describe  the 
first  enhancement,  storage-class  analysis,  that  decides  the  storage  class  for  live  ranges.  In 
Section  6.3, 1  present  benefit- driven  simplification  that  determines  the  color  ordering  in  a 
novel  manner.  In  Section  6.4,  I  present  the  third  enhancement,  preference  decision,  that 
determines  the  preferable  kind  of  register  for  live  ranges.  In  Section  6.5,  I  describe  the 
fourth  enhancement,  call-cost  splitting,  which  splits  live  ranges  to  reduce  call  cost.  In 
Section  6.6, 1  evaluate  the  effectiveness  of  the  first  three  enhancements.  In  Section  6.7, 1 
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evaluate  the  influence  of  the  call-cost  splitting  enhancement. 

Figure  6.1  sketches  the  framework  of  the  register  allocator.  The  shaded  regions  are 
the  places  where  the  four  improvements  are  implemented.  Storage- class  analysis  and 
preference  decision  are  implemented  in  the  color-assignment  phase;  benefit-driven  sim¬ 
plification  is  done  in  the  color-ordering  phase;  call-cost  splitting  is  implemented  in  the 
graph-construction  phase.  The  improvements  by  no  means  can  only  be  used  by  our  register 
allocator.  Indeed,  because  various  register  allocation  approaches  have  been  implemented 
in  this  same  framework,  these  approaches  may  all  benefit  from  the  improvements.  For 
instance,  fusion-style,  Chaitin-style,  and  optimistic  coloring  all  profit  by  storage-class 
analysis,  benefit-driven  simplification,  and  preference  decision.  Priority-based  coloring, 
however,  cannot  use  benefit-driven  simplification  because  it  does  not  use  simplification  to 
determine  the  coloring  order. 


Figure  6.1:  Call-cost  directed  register  allocation. 


6.1  Chaitin-style  base  model 

Chaitin-style  coloring  simplifies  an  interference  graph  to  determine  the  coloring  order  and 
spills  live  ranges  based  on  spill  cost  (either  static  estimated  or  dynamic  spill  cost)  when 
simplification  blocks.  A  live  range  that  contains  calls  prefers  using  a  callee-save  register  so 
the  color-assignment  phase  attempts  to  find  a  callee-save  register.  If  there  is  no  callee-save 
register  left  (taken  by  all  the  live  range’s  neighbors),  a  legal  caller-save  register  is  assigned 
to  the  live  range.  A  live  range  containing  no  calls  prefers  using  a  caller-save  register  so 
the  color- assignment  phase  attempts  to  find  a  caller-save  register.  If  there  is  no  caller-save 
register  left,  a  legal  callee-save  register  is  assigned  to  the  live  range. 

This  simple  model  is  used  as  the  base  for  comparison.  It  allows  us  to  access  the  signifi¬ 
cance  of  each  improvement  in  terms  of  reducing  overall  overhead  operations.  Although  the 
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base  model  is  simple,  it  is  not  unreasonable.  In  Chapter  7,  we  compare  this  base  model  to  a 
more  sophisticated  approach  to  handle  call  cost  which  is  implemented  in  various  compilers. 
However,  for  a  significant  number  experiments,  the  simple  base  model  outperforms  the 
other  approach. 

6.1.1  Limitations  of  the  base  model 

Figure  6.2  depicts  the  register  allocation  cost  for  two  programs  from  the  SPEC92  suite, 
eqntott  and  ear,  for  various  combinations  of  caller-save  and  callee-save  registers. 
Again,  the  MIPS  architecture  has  separate  register  banks  for  integer  and  floating-point 
values.  The  notation  (^Ri,rtf,Ei,Ef)  of  the  x-axis  of  Figure  6.2  shows  the  combination  of 
caller-save  and  callee-save  registers  for  the  two  register  banks.  Ri  and  R/  are  the  number 
of  caller-save  registers  for  integer  and  floating-point  values,  respectively.  Ei  and  Ef  are  the 
number  of  callee-save  registers  for  integer  and  floating-point  values. 


ear  eqntott 

Figure  6.2:  Register  allocation  cost. 


From  Figure  6.2,  we  see  that  giving  the  register  allocator  more  registers  can  reduce  the 
spill  cost  dramatically.  The  spill  cost  is  cut  down  to  0.8  %  with  (10,8,4,4)  registers  for 
eqntott  and  to  0.5  %  with  (9,7,3,3)  registers  for  ear.  However,  simply  focusing  on  the 
spill  cost  is  misleading;  the  figure  shows  that  spill  cost  is  no  longer  an  issue  once  a  certain 
number  of  registers  is  available.  As  the  figure  illustrates,  we  must  take  the  call  cost  (caller- 
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save  and  callee-save  cost)  into  consideration.  There  are  two  noteworthy  aspects:  first,  the 
contribution  of  the  call  cost  to  total  register  allocation  cost  is  significant.  Second,  giving  the 
register  allocator  more  registers  may  actually  worsen  the  register  allocation  cost  because 
some  live  ranges  may  now  reside  in  the  registers  whose  call  overheads  introduce  more 
memory  accesses  than  the  spill  cost  of  the  live  ranges.  Cohn  and  Lowney  [18]  measured  the 
dynamic  call  cost  for  various  Windows/NT  applications  and  SPEC95  programs.  For  more 
than  half  of  the  programs,  the  call  cost  is  10%  -  25%  of  their  execution  time.  If  the  register 
allocation  does  not  take  the  call  cost  into  consideration,  the  call  overhead  introduced  by  the 
register  allocation  may  end  up  on  the  hot  part  of  the  program  [18]. 


6.2  Storage-class  analysis 

A  live  range  can  reside  in  one  of  these  storage  classes  (assuming  that  registers  are  divided 
into  caller-save  and  callee-save  registers): 

•  memory, 

•  a  caller-save  register,  or 

•  a  callee-save  register, 

Each  storage  class  has  an  associated  cost.  Storage-class  analysis  decides  where  live  ranges 
reside,  based  on  two  functions,  benefit jMllcr  and  benefit.eallee.  The  two  functions 
model  the  benefits  provided  by  these  two  storage  classes  over  the  default  location  (memory) . 
These  functions  are  defined  for  each  live  range  Ir.  For  each  Ir,  benefit.ealler{lr)  is 
defined  as  the  weighted  reference  counts  of  the  spill  code  minus  the  weighted  caller-save 
cost.  bcnefit.callee{lr)  is  defined  as  the  weighted  reference  counts  of  the  spill  code 
minus  the  weighted  callee-save  cost.  The  weighted  reference  counts  and  call  cost  are 
derived  using  either  estimated  execution  frequency  (static)  or  profiling  information  of  some 
prior  execution  (dynamic).  That  is,  these  two  functions  indicate  the  estimated  number  of 
load/store  operations  that  are  eliminated  if  a  caller-save  (or  callee-save)  register  is  assigned 
to  Ir.  The  two  functions  are  similar  to  the  priority  function  of  [16],  except  the  benefits  are 
not  normalized  by  the  size  of  the  live  ranges. 
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During  the  color- assignment  phase,  the  selection  of  the  kind  of  register  to  use  is  based 
on  these  two  functions.  If  bcnefit^callec{lr)  >  bcncfit^callcr{lr),  finding  an  available 
callee-save  register  for  Ir  is  attempted  prior  to  finding  an  available  caller-save  register.  If 
benefit  ^callcc{lr)  <  6cnc /ii-ca//cr(/r),  it  is  prefer  able  to  put /r  into  a  ca/fer-.yave  register 
over  using  a  callee-save  register.  The  accuracy  of  the  two  benefit  functions  depends  on  the 
accuracy  of  the  estimated  execution  frequency. 

The  simplification  phase  guarantees  that  the  color- assignment  phase  finds  registers  for 
live  ranges  that  are  on  the  color  stack  (C).  However,  sometimes  not  using  a  register  (i.e., 
spilling  a  live  range)  is  better  than  using  the  wrong  kind  of  register  because  the  cost  of 
using  the  register  (callee-save  or  caller-save  cost)  could  be  greater  than  the  spill  cost  of 
the  live  range.  Figure  6.3  illustrates  the  effect  of  assigning  the  wrong  kind  of  register  to 
live  range  x.  The  shaded  regions  indicate  the  most  frequently  executed  paths.  Using  a 
caller-save  register  for  live  range  x  in  Figure  6.3  (a)  incurs  a  high  cost;  keeping  x  in  a 
caller-save  register  can  only  make  the  result  of  register  allocation  worse  because  x  must  be 
saved  and  restored  once  for  each  use  of  x  in  the  loop.  The  two  memory  accesses  required  to 
save/restore  the  caller-save  register  are  more  than  the  single  restore  operation,  which  would 
be  required  if  x  had  been  spilled. 

Likewise,  in  Figure  6 .3  (b) ,  keeping  a:  in  a  callee-save  register  results  in  a  higher  register- 
allocation  overhead  than  spilling  because  saving  and  restoring  the  value  of  the  callee-save 
register  happens  on  the  most  frequently  executed  path. 

Our  algorithm  models  register  assignment  as  a  possible  improvement.  That  is,  we 
start  out  with  the  assumption  that  a  live  range  resides  in  memory  (is  spilled)  and  then  try 
to  determine  if  allocating  a  register  reduces  the  overhead  operations.  This  model  allows 
us  to  spill  live  ranges  to  reduce  the  overall  number  of  load/store  operations  even  though 
there  are  available  registers.  The  two  benefit  functions  give  us  a  good  indication  if  a 
live  range  Ir  is  a  worthwhile  candidate  for  register  residence.  If  Ir  gets  a  caller-save 
register  and  benefit.caller(lr)  <  0,  then  spilling  Ir  reduces  the  load/store  counts  by 
I  bene  f  it -caller  (Ir)  |. 

While  assigning  registers  to  live  ranges,  the  color-assignment  phase  makes  spilling 
decisions  right  away  for  the  live  ranges  that  are  supposed  to  get  caller-save  registers  (i.e., 
spill  when  bene  fit -call  cr{lr)  <  0).  There  are  two  ways  to  make  spilling  decisions  for 
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(a)  High  caller-save  cost  (b)  High  callee-save  cost 


Figure  6.3:  Storage-class  analysis. 

live  ranges  that  receive  callee-save  registers.  The  first  approach  models  callee-save  costs  as 
occuring  only  once  for  each  callee-save  register.  For  the  first  live  range  Ir  that  uses  a  given 
callee-save  register,  making  the  spilling  decision  then  is  similar  to  making  the  decision  to 
use  a  caller-save  register.  That  is,  if  bcncfit-callcc(lr)  <  0,  then  Ir  is  spilled  to  memory. 
For  a  live  range  that  is  not  the  first  user  of  a  callee-save  register,  the  live  range  can  use 
the  callee-save  register  for  free  (since  there  in  no  need  to  spill  the  register).  The  second 
approach  views  callee-save  costs  as  shared  by  all  live  ranges  that  share  a  callee-save  register. 
That  is,  the  first  live  range  to  use  the  register  does  not  pay  all  the  costs;  the  cost  is  spread  over 
all  users.  The  spilling  decisions  for  live  ranges  that  (potentially)  get  callee-save  registers 
are  not  finalized  until  the  color- assignment  phase  finishes.  At  that  time,  it  is  known  which 
live  ranges  may  use  a  specific  callee-save  register.  For  a  callee-save  register  r,  and  5{r) 
the  set  of  live  ranges  that  share  r,  if  YlireS{r)  spill -cost{lr)  <  callccj:o$t{r),  then  all  live 
ranges  of  ^(r)  are  spilled. 

Our  experimental  data  indicate  that  the  second  approach  performs  better  than  the  first 
one  for  some  SPEC92  programs,  for  others  it  makes  no  difference.  To  illustrate  why  this 
is  the  case,  consider  two  live  ranges  with  spill  cost  4000  that  share  the  same  callee-save 
register,  and  the  callee-save  cost  is  5000.  The  first  approach  does  not  assign  any  of  the  two 
live  ranges  to  the  callee-save  register  because  of  the  high  first-use  cost.  At  the  end,  spilling 
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the  two  live  ranges  introduces  8000  load/store  operations  (assume  that  they  also  have  high 
caller-save  cost,  or  all  caller-save  registers  are  taken  by  their  neighbors).  However,  the 
second  approach  assigns  the  two  live  ranges  to  the  register  —  a  decision  that  saves  3000 
load/store  operations  over  spilling. 

The  flow  of  the  color- assignment  phase  using  storage-class  analysis  is  depicted  in 
Figure  6.4.  This  phase  pops  live  ranges  out  of  C  and  assigns  a  register  to  each  of  them.  If  a 
live  range  Ir  prefers  a  callee-save  register  (benefit xallcc{lr)  >  bcnefitxaller(lr)),  the 
register  allocator  first  attempts  to  find  a  legal  callee-save  register.  If  a  callee-save  register 
is  found,  then  the  next  step,  choosing  a  caller-save  register,  does  nothing  because  Ir  is 
already  assigned  a  register  that  Ir  desires.  A  live  range  that  prefers  a  caller-save  register  is 
also  assigned  a  register  in  a  similar  fashion.  For  simplicity,  benefit.eallec(lr)  <  0  is  the 
heuristic  to  spill  a  live  range  that  receives  a  callee-save  register.  The  condition  benefit  >  0 
means  benefit jcallee(lr)  >  0  for  callee-save  registers  ,  and  benefit.caller{lr)  >  0  for 
caller-save  registers.  The  spill  step  merely  adds  a  live  range  with  benefit  <  0  to  the  spill 
pool  (5). 
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Figure  6.4:  Structure  of  the  color  assignment  using  storage-class  analysis. 
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6.3  Benefit-driven  simplification 


Simplification  removes  live  ranges  one  by  one  from  the  interference  graph  and  pushes  them 
onto  C.  The  reverse  ordering  in  which  live  ranges  are  pushed  onto  C  is  the  ordering  in 
which  they  are  assigned  colors.  Simplification  is  a  nice  and  easy  way  of  packing  live  ranges 
into  registers  but  there  is  no  cost  model,  and  therefore,  decisions  during  simplification 
may  turn  out  to  handicap  the  color-assignment  phase.  If  the  target  machine  has  only  one 
kind  of  register,  the  order  of  removing  already-unconstrained  live  ranges  does  not  have  any 
influence  on  the  final  register  overhead,  because  using  each  register  incurs  the  same  cost. 
But  for  a  machine  with  a  mixture  of  registers  of  different  costs,  the  position  of  a  live  range 
on  C  plays  an  important  role  in  reducing  the  register  overhead. 

Each  live  range  has  two  benefit  functions,  benefit  Miller  and  bene  fit  malice,  which 
determine  the  preferred  kind  of  register.  Live  ranges  on  top  of  C  have  a  higher  chance 
to  obtain  the  preferred  kind  of  register,  because  fewer  registers  are  already  taken  by  other 
live  ranges.  For  example.  Figure  6.5  (a)  depicts  an  interference  graph  consisting  of  3  live 
ranges.  All  prefer  using  a  callee-save  register.  There  are  only  two  callee-save  registers 
available.  One  live  range  must  use  a  caller-save  register.  With  N  =  3,  all  three  live  ranges 
are  unconstrained.  A  legal  color  stack  is  shown  in  Figure  6.5  (b)  (obtained  by  removing 
Iry,  Ir^,  and  then  Ir^).  The  top  two  live  ranges,  and  Ir^,  are  assigned  callee-save 
registers  during  the  color- assignment  phase  and  Iry  ends  up  in  the  caller-save  register. 
This  assignment  then  saves  3200  loadystore  operations  over  spilling.  However,  the  best 
ordering  during  simplification  is  Ir,.,  Iry,  and  then  Ir^,  so  that  Ir^  and  Iry  can  obtain  the  two 
callee-save  registers;  this  assignment  actually  saves  4100  load/store  operations. 

During  simplification,  if  there  is  more  than  one  unconstrained  live  range,  then  the  live 
range  that  has  the  smallest  priority  is  removed.  Now  we  can  view  C  as  a  priority-based 
color  stack;  the  higher  position  a  live  range  on  C,  the  more  likely  the  live  range  has  more 
freedom  with  regard  to  picking  registers.  Hence  register  allocation  based  on  this  color  stack 
is  now  similar  to  priority-based  coloring  [16].  In  other  words,  benefit- driven  simplification 
unifies  the  priority-based  approach  with  simplification-based  register  allocation. 

The  flow  of  simplification  used  by  Chaitin-style,  optimistic,  and  fusion-style  coloring 
to  determine  the  color  ordering  is  depicted  in  Figure  6.6.  Moreover,  Figure  4.10  and  Figure 
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benefit ^  caller  ^  IjOOO^lr, 
beneftsul^ee 


benefit _caller  =100 
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Figure  6.5:  Effect  of  simplification  order,  with  N  =  3  registers  (2  callee-save  and  1 
caller-save). 


Figure  6.6:  Structure  of  benefit-driven  simplification. 


6.6  demonstrate  that  the  same  simplification  framework  has  been  shared  by  various  register 
allocation  approaches.  In  the  color-assignment  phase,  cliques  no  longer  exist,  and  no  live- 
range  splitting  is  allowed  because  fusing  graphs  is  already  completed.  Hence,  the  task  of 
the  “lowering  pressure”  step  in  the  color- assignment  phase  is  quite  different  for  and  much 
simpler  than  that  in  Figure  4.10.  Here  the  lowering-pressure  step  simply  chooses  one  live 
range  based  on  a  heuristic  to  spill  and  relaxes  the  degrees  of  the  live  range’s  neighbors.  The 
removing-LSJ^  step  in  Figure  6.6  is  composed  of  choosing  one  unconstrained  live  range 
with  the  smallest  key  to  simplify  and  relaxing  the  degrees  of  the  live  range’s  neighbors. 
One  implementation  of  keeping  track  of  the  unconstrained  live  range  with  the  smallest  key 
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in  LS-N  is  implementing  LS-N  using  a  heap  structure  [19].  In  the  “relaxing  degree”  step, 
as  constrained  live  ranges  become  unconstrained,  the  live  ranges  are  inserted  into  the  LS-N 
heap  based  on  a  key  function,  benefit -caller  and  benefit -callce  form  the  foundation  of 
our  experiments  to  investigate  different  strategies  to  order  unconstrained  live  ranges  during 
simplification.  Specifically,  we  investigate  two  keys  to  order  live  ranges: 

•  m2Lx{bcnc  fit -caller  y  benefit  jzallcc) 

•  or 

(I  bene  fit -Caller  —  benefit  .call  cc  \  if  bene  fit -call  cr  >  0  and 

benefit jcallcc  >  0 

m20i(bcnc  fit -caller^  bcnefitjcallcc)  otherwise 

Priority-based  coloring  uses  this  strategy  to  prioritize  so  as  to  make  sure  that  the  live 
range  with  maximum  savings  has  the  highest  priority  to  occupy  a  register.  However, 
this  strategy,  when  implemented  as  part  of  benefit-driven  simplification,  increases  the 
register  overhead  for  some  SPEC92  programs,  compared  to  register  allocation  without 
using  benefit-driven  simplification.  This  strategy  is  not  suitable  for  Chaitin-style  register 
allocation  because  simplification  guarantees  that  all  live  ranges  on  C  can  find  registers; 
in  other  words,  making  sure  that  live  ranges  with  bigger  savings  own  registers  is  not  a 
concern  for  Chaitin-style  coloring.  What  Chaitin-style  coloring  cares  about  is  the  incurred 
penalty  of  using  the  wrong  kind  of  register  —  it  cares  more  about  the  delta  between  the 
two  benefit  functions  rather  than  the  maximum  of  the  two.  Therefore  the  second  strategy 
is  used.  The  delta  is  the  incurred  penalty  of  assigning  the  wrong  kind  of  register  to  a  live 
range.  If  bcncfit-callcr{lr)  and  bcncfitjcallcc{lr)  are  greater  than  or  equal  to  zero,  the 
key  function  is  the  difference  of  the  two  benefit  functions.  Otherwise,  the  maximum  of  the 
two  benefit  functions  is  used  as  the  key  function. 

The  example  in  Figure  6.7  illustrates  how  the  key  functions  make  a  difference  in  register 
allocation  overhead.  The  interference  graph  in  Figure  6.7  (a)  comprises  three  live  ranges, 
l^x,  l^y,  and  Ir^,  Using  the  first  definition,  Ir^  and  Iry  have  higher  priorities  (2000)  than 
Ir.^  (1500).  The  order  in  which  live  ranges  are  removed  from  the  graph  is  Ir,^,  Iry,  and  Ir^^, 
Because  all  three  live  ranges  prefer  a  callee-save  register,  Ir^  and  try  are  assigned  the  two 
callee-save  registers.  The  total  savings  amounts  to  4500  load/store  operations.  With  the 
second  definition,  Ir^r  has  a  higher  priority  than  and  Iry,  because  for  Ir.^  the  penalty  of 
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using  the  wrong  kind  of  register  is  higher  than  for  the  other  two.  Thus  Ir,  ends  up  on  top 
of  C.  The  second  definition  yields  a  better  allocation  than  the  first  one  (and  results  in  a  total 
savings  of  5300  load/store  operations). 


6.4  Preference  decision 

One  shortcoming  of  Chaitin-style  register  allocation  is  that  low-priority  (small-savings) 
live  ranges  with  high  degree  may  take  away  the  kind  of  register  that  high-priority  (big- 
savings)  live  ranges  crave  for.  Benefit-driven  simplification  tries  to  prevent  this  situation 
from  happening.  Nevertheless,  there  is  still  no  guarantee  that  live  ranges  with  high  priority 
get  the  registers  they  prefer,  because  simplification  is  sensitive  to  the  degree  of  live  ranges 
(a  live  range  can  be  removed  from  the  interference  graph  only  if  the  live  range’s  degree 
is  less  than  iV).  The  color- assignment  phase,  however,  examines  only  the  two  benefit 
functions,  benefit -call  cc  and  bene  fit -caller  j  to  determine  what  kind  of  register  a  live 
range  should  get  without  caring  about  the  needs  of  other  live  ranges.  This  restricted  view 
of  the  color- assignment  phase  is  the  cause  of  unnecessary  overhead  operations. 

Therefore,  prior  to  the  color- assignment  phase,  we  provide  a  separate  phase  that  pre¬ 
determines  the  preferable  kind  of  register  for  some  live  ranges.  The  purpose  of  this  phase  is 
to  minimize  the  caller-save  overhead.  This  phase  goes  through  each  function  call  in  order  of 
weighted  execution  frequency  and  makes  the  preference  decision  for  live  ranges  across  the 
function  call.  For  a  function  call,  if  the  number  (L)  of  the  live  ranges  that  contain  the  call 
and  prefer  callee-save  registers  is  less  than  or  equal  to  the  number  of  available  callee-save 


benefit jcaller  -  1,800 
benefU_caJlee  -  2,000 


y  j  benefit _caller  =  1,800 
benefit _callee ::  2,000 


benefit  ^caller  =  500 
heiiefit  jxiilee  1 ,500 
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l^x 


K 
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Figure  6.7:  Priority  with  N  ~  3  registers  (2  callee-save  and  1  caller- save). 
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Figure  6.8:  Preference  decision  example  with  N  =  3  registers  (1  callee-save  and  2  caller- 
save). 

registers  (M),  then  there  is  no  preference  decision  that  needs  to  be  made.  If  L  is  greater 
than  M,  then  regardless  of  how  registers  are  assigned,  at  least  L  —  Jil  live  ranges  must  use 
caller-save  registers  rather  than  callee-save  registers. 

We  sort  those  live  ranges  using  a  key  function.  The  least  L  -  M  live  ranges  are 
annotated  so  that  their  preferable  kind  of  register  is  caller- save  regardless  of  the  fact  that 
benefit ^callcc  >  benefit jzallcr.  After  registers  are  assigned  to  those  live  ranges,  just 
like  for  regular  live  ranges,  register  allocation  uses  bencfit^callec  and  benefit  ^caller  to 
analyze  if  they  should  reside  in  registers  or  memory.  The  key  is  defined  as: 

{caller -Cost  if  benefit -call  cr  >  0 
spill -cost  otherwise. 

caller -cost{lr)  is  the  overhead  incurred  for  Ir  using  caller-save  registers  which  is  equal  to 
spilLcost  -  bene  fit-caller.  For  those  live  ranges  with  bene  fit-caller  >  0,  caller -cost  is 
used  as  the  key  (ca/^cr-co5f  is  the  overhead).  For  the  live  ranges  whose  6cnc/zf-cct//cr  <  0, 
the  key  is  spilLcost  because  storage-class  analysis  spills  them  if  they  do  not  get  callee-save 
registers.  In  other  words,  spilLcost  is  the  incurred  penalty  for  not  using  a  callee-save 
register. 

Figure  6.8  (a)  depicts  an  interference  graph  with  five  live  ranges.  The  two  benefit 
functions  of  each  live  range  are  listed  in  the  table  of  Figure  6.8  (b).  Callee-save  registers 
are  the  precious  resources  in  this  example  (all  live  ranges  are  competing  for  callee-save 
registers).  The  key  values  of  live  ranges  which  are  used  in  benefit-driven  simplification  are 
Irt  (100),  Wy  (100),  Ir.  (100),  Ir^  (300),  and  Ir^,  (3900).  Only  can  be  removed  at  first 
given  N  =  3,  because  the  degrees  of  the  other  live  ranges  are  greater  than  or  equal  to  3.  Let 
the  subsequent  order  in  which  live  ranges  are  removed  from  the  graph  be  Ir^,,  Ir^  and 
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Itx>  The  ordering  in  which  live  ranges  are  assigned  colors  is  try,  Irt,  and  Ir^ 
then  takes  away  the  callee-save  register  and  leaves  no  callee-save  register  but  the  caller-save 
registers  for  Iru  Ivy  and  Ir^  to  use  {Ir,  gets  the  same  register  as  /rj.  This  assignment 
saves  900  load/store  operations  (300  and  600  for  using  caller-save  and  callee-save  registers, 
respectively).  If  live  ranges  Ir^  and  Ivu;  contain  the  same  (high-frequency)  function  call, 
then  It:^  is  forced  to  reside  in  a  caller-save  register  instead  of  the  callee-save  register.  This 
color  assignment  then  saves  4500  load/store  operations  und  Irt  get  the  caller-save 

registers,  and  Ir^j  and  Ivy  reside  in  the  callee-save  register). 


6.5  Splitting  to  reduce  call  cost 

Fusing  two  interference  graphs  tends  to  grow  live  ranges  aggressively  to  reduce  the  shuffle 
cost  as  long  as  the  colorability  invariant  is  maintained.  Storage-class  analysis,  benefit-driven 
simplification,  and  preference  decision  are  the  techniques  that  adjust  spill,  caller-save,  and 
callee-save  costs.  Caller-save  cost  may  be  reduced  at  the  expense  of  spill  cost.  Growing 
a  live  range  aggressively  could  potentially  include  more  call  sites  into  the  live  range. 
Consequently,  the  caller-save  cost  of  the  live  range  increases.  High  caller-save  cost  of  a  live 
range,  /r,  means  two  things  when  a  caller-save  register  is  assigned  to  Ir,  Ir  either  is  spilled 
or  pays  high  caller-save  cost. 

One  improvement  to  fusion-based  coloring  is  to  limit  the  growth  of  live  ranges  that 
cross  function  calls  by  suppressing  coalescing.  That  is,  the  technique  can  be  applied  only  to 
live-range  splitting  approaches.  The  improvement  is  called  call-cost  splitting  that  splits 
live  ranges  to  reduce  call  cost.  By  suppressing  the  growth  of  a  live  range  containing  calls, 
the  (smaller)  live  range  has  a  chance  to  get  the  best  kind  of  register  or  be  spilled  (at  lower 
spill-cost),  thus  reducing  the  overall  load/store  counts.  Even  though  the  graph  is  colorable, 
it  may  be  better  if  the  live  range  is  split  at  infrequently  executed  edges;  then  we  pay  the 
lower  shuffle  cost  (along  these  edges)  instead  of  the  higher  caller-save  cost  at  all  call  sites. 

The  unioning-splitable  step  (Section  4.2.5)  is  the  place  where  the  call-cost  splitting  takes 
place.  Prior  to  unioning  two  splitable  live  ranges,  Iri  and  Irz,  along  edge  E,  the  call-cost 
splitting  determines  if  suppressing  combining  Iri  and  lr2  on  edge  E  could  likely  reduce 
overhead  operations.  There  is  no  guarantee  that  splitting  on  edge  E  is  effective  because  the 
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final  result  is  unknown  till  the  color- assignment  phase. 

A  register  allocation  heuristic  should  take  every  aspect  of  the  cost  model  into  account, 
and  avoid  making  any  decision  based  on  part  of  the  cost  model.  Hence  we  discuss  call-cost 
splitting  from  the  viewpoints  of  shuffle,  spill,  and  call  costs. 


•  shuffle  and  spill  costs:  Splitting  In  and  In  on  edge  E  =  (Zli,  fJa)  induces  shuffle 
code  on  E.  The  shuffle  cost  on  E  should  be  cheap  relative  to  either  lr\  or  In¬ 
in  other  words,  one  of  the  live  ranges  can  afford  the  induced  shuffle  cost.  The 

condition  we  check  is /rcg(£)  >  spilLcost{ln)/a^  frcq{E)  >  spilLcoat{ln)la, 

where  frcq{  E)  is  the  estimated  execution  frequency  of  E  and  a  is  a  constant  that 
is  controlled  by  the  user  or  the  register  allocator,  "o  =  100"  means  that  one  live 
range’s  spill  cost  is  at  least  100  times  more  than  the  potentially  incurred  shuffle  cost. 
A  big  a  indicates  that  live  ranges  are  allowed  to  grow  larger  than  a  small  a— split  less 
frequently.  The  bigger  the  number  of  registers,  the  more  likely  the  color- assignment 
phase  assigns  the  preferable  kind  of  register  to  a  live  range.  Therefore,  call-cost 
splitting  should  happen  less  frequently  (use  a  big  a).  If  the  number  of  registers  is 
small,  live  ranges  are  competing  for  their  preferred  kind  of  registers.  Thus,  call-cost 
splitting  should  be  applied  more  frequently  (use  a  small  a). 

•  call  cost:  The  previous  condition  decides  if  i?  is  a  good  splitting  point.  We  want  to 
determine  if  coalescing  Iri  and  In  into  Irn  may  introduce  more  overhead  operations 
than  suppressing  In  and  In  provided  that  Iru  ends  up  in  a  caller-save  register.  Here 
we  look  for  the  case  that  one  live  range  (say  In)  has  hcncfitJMllcr  >  0  and  the  other 
live  range  (say  In)  has  benefit. caller  <  0,  because  we  ge\.hcnefUjMller{lrn)  < 
bcncfit.callcr{lrf]  if  coalescing  of  the  two  live  ranges  is  not  suppressed.  In  addition, 
we  check  the  condition  caller  j:os<(/r2)  >  2*frcq{E).  Splitting  Iri  and  Iri  prevents 
caller. cost{lrt)  from  being  added  to  caller xost{lri)  at  the  expense  of  shuffle  cost. 
Once  coalescing  of  Iri  and  lr2  is  suppressed,  not  only  is  shuffle  code  generated  at  the 
entry  to  Irz,  but  also  is  shuffle  code  potentially  introduced  at  the  exit  of  Irt  because 
Irt  is  a  stand  alone  live  range.  We  use  2*  frcq{E)  to  estimate  the  possible  incurred 
shuffle  code  operations. 


(a)  (b) 

Figure  6.9:  Eliminating  caller-save  cost  by  splitting. 


Figure  6.9  depicts  an  example  of  how  caller-save  cost  is  reduced  by  splitting.  In  this 
example,  the  live  range  x  is  live  through  out  the  whole  program.  Assume  lr(x)  is  not  split, 
i.e.  X  occupies  one  register  in  Figure  6.9  (a).  If  the  register  is  a  caller-save  register,  lr{x) 
pays  a  high  caller  cost  (saving  and  restoring  x  around  function  f  oo).  It  is  much  cheaper 
to  split  the  live  range  of  x  into  lr\{x)  and  lr-i{x)  along  the  edge  (i?2,  ^3)»  as  illustrated  by 
Figure  6.9  (b).  Now  lri{x)  can  reside  in  either  memory  or  a  callee-save  register  (if  the 
callee-save  cost  is  low). 


6.6  Evaluation  of  SC,  BS,  and  PR  enhancements 

In  this  section  we  report  on  an  empirical  evaluation.  We  measure  the  influence  of  differ¬ 
ent  combinations  of  the  improving  techniques  for  various  SPEC92  programs:  (alvinn, 
compress,  ear,  eqntott,  espresso,  gcc,  li,  sc,  doduc,  fpppp?  matrixSOO, 
nasa7,  spice,  and  tomcatv).  We  pay  special  attention  to  the  issue  of  deciding  between 
a  caller-save  and  a  callee-save  register  for  a  live  range.  The  y-axis  of  the  figures  in  this 
section  shows  the  register  overhead  produced  by  a  base  Chaitin-style  coloring  (with  the 
simple  cost  model)  divided  by  the  overhead  of  an  improved-version  Chaitin-style  register 
allocator  using  various  combinations  of  the  enhancements  (storage-class  analysis,  benefit- 
driven  simplification,  and  preference  decision).  The  bigger  this  number,  the  less  overhead 
is  there  in  the  improved  version.  SC,  BS,  and  PR  stand  for  Storage-Class  analysis.  Benefit- 
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driven  Simplification,  and  PReference,  respectively.  We  can  classify  the  SPEC92  programs 
(compiled  using  profile  information)  into  4  classes. 


•  Each  optimization  contributes  a  significant  fraction  of  improvement.  Examples  are 
nasaV  and  ear  shown  in  Figure  6.10.  For  nasaV  and  ear,  there  is  not  much  room 
for  optimizations  to  reduce  overhead  operations  if  only  a  smaller  number  of  registers 
is  available.  With  more  registers,  the  optimizations  have  more  freedom  to  choose  the 
right  kinds  of  storage  class  (caller-save  registers,  callee-save  registers,  or  memory) 
for  live  ranges. 


nasa7 


ear 


Figure  6.10:  SC,  BS  and  PR  are  effective. 

•  Only  storage-class  analysis  has  a  dramatic  improvement;  spilling  live  ranges  that  have 
the  wrong  kind  of  registers  helps  reducing  the  overall  overhead  operations.  Examples 
are  li,  sc,  andmatrixSOO.  See  Figure  6.11. 

•  Pre-determining  the  preferred  kind  of  registers  for  live  ranges  does  not  affect  the 
overall  number  of  overhead  operations.  Examples  of  this  class  are  compress, 
eqntott ,  espresso,  gcc,  fpppp,  doduc,  and  spice  (shown  in  Figure  6.12 
and  Figure  6.13). 

•  For  programs  that  have  low  callee-save  and  caller-save  costs,  such  as  tomcatv, 
which  consists  of  only  one  big  function  and  no  calls,  none  of  the  three  techniques 
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Figure  6.11;  Only  SC  is  effective. 
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Figure  6.13:  SC  and  BS  are  effective  (cont.). 


makes  any  difference.  All  the  ratios  of  Base-Chaitin/Improved- version  for  the  three 
improvements  are  1.0  (Figure  6.14). 


Figure  6.14:  No  improvement  for  tomcatv. 


Figure  6.15  shows  the  effect  of  these  improvements  on  the  programs  from  Figure 
6.2.  This  figure  depicts  the  register  overhead  for  ear  and  eqntott  with  the  three 
improvements.  For  the  ear  and  eqntott  programs,  the  improved Chaitin-style  coloring 
reduces  the  register  overhead  by  a  factor  of  45  and  66  (i.e,  with  the  base  register  allocator, 
there  are  45  (66)  times  as  many  overhead  operations  as  required  by  the  improved  Chaitin- 
style  coloring). 

I  also  evaluate  how  storage-class  analysis,  benefit-driven  simplification,  and  preference 
decision  reduce  register  overhead  operations  using  static  information.  Again  static  informa- 
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Figure  6.15;  Register  overhead  for  improved  register  allocation. 


tion  uses  the  loop  hierarchy  (nesting  levels)  as  an  approximation  of  the  execution  frequency 
of  a  basic  block.  The  results  are  shown  in  Figures  6.16,  6.17,  and  6.18.  Figures  6.17  and 
6.18  show  those  programs  that  exhibit  the  same  trend  of  improvement  as  using  dynamic 
information.  The  three  improvements,  rather  than  ameliorate,  deteriorate  the  quality  of 
register  allocation  (generate  more  operations  than  the  base  Chaitin)  for  the  programs  in 
Figure  6.16. 

The  three  improvements  are  based  on  the  analysis  of  spill  cost,  bcncfitxaller  and 
bcncfit-callcc  which  solely  depend  on  the  accuracy  of  the  estimated  execution  frequency. 
Using  the  loop  hierarchy  to  estimate  the  execution  frequency  treats  every  block  within  the 
same  loop  as  having  the  same  execution  frequency.  In  practice,  this  assumption  is  not 
always  true.  When  the  static  information  does  not  capture  the  actual  execution  pattern  of 
a  program,  the  techniques  may  worsen  the  register-allocation  result.  For  example,  the  live 
range  Ir^  in  Figure  6.19  crosses  the  function  call  of  f  oo  ( ) ,  which  is  in  the  loop  (DA). 
Based  on  the  static  information,  the  loop  is  considered  to  be  more  frequently  executed  than 
the  rest  of  the  blocks.  bcncfitxallce{lrx),  therefore,  is  greater  than  bcncfitxallcrilr,^. 
The  color- assignment  phase,  therefore,  attempts  to  put  Ir^  into  a  callee-save  register  at  first. 
There  are  two  outcomes  for  hs  of  the  color-assignment  phase:  (1)  Ir^  obtains  a  callee-save 
register  or,  (2)  Ir^  gets  a  caller-save  register  and  then  is  spilled  to  memory  by  the  storage- 
class  analysis  due  to  the  high  caller-save  cost.  If,  in  fact,  the  shaded  region  depicts  the  most 
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do  due 


fpppp 


Figure  6.17:  Similar  trend  of  improvement  as  the  dynamic  case, 
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tomcatv 


Figure  6.18:  Similar  trend  of  improvement  as  the  dynamic  case  (cont.). 
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frequently  executed  path  instead,  then  case  (1)  pays  high  callee-save  costs,  and  case  (2) 
incurs  high  spill  costs.  The  register  overheads  of  compress,  sc,  gcc,  li,  and  spice 
(Figure  6.16)  are  hurt  due  to  this  reason.  There  is  one  noticeable  aspect.  When  the  number 
of  registers  is  small,  the  more  improvements  are  added  into  register  allocation,  the  worse  the 
register  allocation  result.  As  the  number  of  registers  increases,  we  see  that  the  integration 
of  all  three  improvements  can  make  up  the  inaccuracy  of  the  static  estimated  execution 
frequency.  That  is,  the  result  of  SC+BS+PR  is  better  than  SC.  If  the  static  information 
gives  us  a  good  indication  about  the  actual  execution  pattern  of  a  program,  we  see  a  similar 
improvement  trend  as  using  dynamic  information  (as  shown  in  Figures  6.17  and  6.18). 
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Figure  6.19:  Dynamic  versus  static. 


6.7  Evaluation  of  call-cost  splitting 

Given  the  importance  of  getting  the  right  kind  of  register  in  the  previous  experiment,  we  take 
a  positive  action  to  limit  the  growth  of  live  ranges,  as  described  in  Section  6.5.  Because  call- 
cost  splitting  can  only  be  applied  in  fusion-style  coloring,  the  base  with  which  we  compare 
this  optimization  is  fusion-style  coloring  without  call-cost  splitting  instead  of  Chaitin-style 
coloring.  Fusion-style  coloring  with  and  without  call-cost  splitting  all  perform  storage-class 
analysis,  benefit-driven  simplification,  and  preference  decision.  Figure  6.20  presents  the 
results  for  alvinn,  nasa7  and  eqntott  for  call-cost  splitting.  We  run  the  experiment 
using  static  and  dynamic  information,  alvinn  and  nasa7  use  all  registers;  eqntott 
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Region  Region+  Region+prfl  Regian+ 
CCS  prfl+CCS 

alvinn 


nasa7 


eqntott 


Figure  6.20:  Impact  of  call-cost  splitting  (CCS). 


uses  14  integer  and  14  floating-point  registers. 

Recall  that  a  live  range  that  is  assigned  a  caller-save  register  but  has  high  caller-save 
cost  and  low  spill  cost  is  spilled  to  memory  during  the  color- assignment  phase,  even  though 
there  are  enough  registers  to  hold  the  live  range.  Once  a  live  range  is  picked  to  be  spilled  at 
this  stage,  no  splitting  is  attempted.  In  other  words,  all  references  go  through  memory.  The 
effect  of  such  a  decision  is  to  increase  the  amount  of  spill  code  to  reduce  loads  and  stores 
of  the  caller-saved  registers.  The  spill  code  that  we  see  in  Figure  6.20  (i.e.,  for  alvinn 
and  eqntott)  is  due  to  such  live  ranges  that  we  spilled  during  color  assignment.  The 
call-cost  splitting  suppresses  coalescing  of  two  live  range  segments  lr\  and  lr2  if  one  of 
them  (say  Irz)  has  high  caller-save  cost.  As  a  result,  the  spill  cost  of  those  live  ranges  stays 
low.  If  those  live  ranges  get  caller-save  registers,  they  would  probably  be  spilled  to  memory 
during  color  assignment  because  of  a  high  caller-save  cost  and  a  low  spill  cost.  Without 
coalescing,  Iri  (with  low  caller-save  cost)  gets  the  desired  caller-save  register;  Ivi  ends 
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up  in  a  memory  or  in  a  callee-saved  register.  If  Iri  is  in  memory,  extra  shuffle  loads  and 
stores  are  required.  If  Irz  is  in  a  callee-save  register  instead,  no  shuffle  loads  and  stores  are 
needed,  but  shuffle  moves  must  be  inserted.  However,  on  most  modem  machines,  moves 
are  cheaper  than  loads/stores. 

The  results  shown  in  Figure  6.20  illustrate  the  benefits  nicely.  The  call-cost  splitting 
succeeds  in  isolating  high  caller-save  cost  regions  at  less  frequently  executed  edges.  Con¬ 
sequently,  the  color-assignment  phase  spills  fewer  live  ranges  to  memory.  In  the  case  of 
alvinn,  about  half  of  all  overhead  operations  are  shuffle  moves.  For  egntott,  there  are 
more  shuffle  moves  than  shuffle  loads  or  stores. 

Overall,  in  the  case  of  alvinn,  the  fusion-based  approach  of  dealing  with  spilling  and 
splitting,  including  call  cost  optimization,  reduces  total  data  movement  overhead  by  80  % 
compared  to  a  Chaitin-style  allocator. 


6.8  Summary 

In  this  Chapter,  I  have  presented  and  evaluated  the  four  enhancements,  storage-class  anal¬ 
ysis,  benefit-driven  simplification,  preference  decision,  and  call-cost  splitting.  The  ex¬ 
perimental  results  show  that  the  call  cost  dominates  the  register  overhead  as  the  number 
of  register  increases,  because  the  register  allocator  is  able  to  assign  registers  to  most  live 
ranges.  Ignoring  the  call  cost  may  lead  the  register  allocator  to  choose  the  wrong  kind  of 
register  for  live  ranges  and  introduce,  as  a  result,  more  overhead  operations. 

Chaitin-style  coloring  with  all  the  enhancements  (of  Sections  6.2  -  6.4)  combines 
the  benefits  of  Chaitin-style  allocation  (pack  more  live  ranges  into  registers)  with  the 
advantages  of  priority-based  coloring  (high  priority  live  ranges  have  more  freedom  in 
choosing  registers).  The  experimental  data  show  that  the  improvement  over  ordinary 
Chaitin-style  coloring  can  be  up  to  a  factor  of  about  50  in  the  number  of  overhead  operations 
(55  for  ear  and  66  for  eqntott). 

Moreover,  I  have  presented  and  evaluated  a  new  technique,  call-cost  splitting,  which 
constrains  the  growth  of  live  ranges  during  graph  fusion  to  reduce  caller-save  cost. 


Chapter  7 

Comparison  of  Various  Approaches 


Chapter  6  describes  three  enhancements,  storage-class  analysis,  benefit-driven  simplifica¬ 
tion,  and  preference  decision.  Experimental  results  have  shown  that  the  integration  of  the 
three  enhancements  provides  an  effective  call-cost  model  to  reduce  call  cost  relative  to 
the  base  Chaitin-style  allocator.  It  is  important  to  evaluate  the  enhanced  model  with  other 
existing  model  that  is  adopted  by  prior  researchers  to  see  if  the  enhanced  model  is  practical 
or  not.  Various  register  allocation  approaches  share  the  same  register  allocation  framework, 
e.g.,  Chaitin-style,  optimistic,  priority-based,  and  fusion-based  coloring.  The  framework 
allows  us  to  compare  as  well  as  integrate  the  enhancements  with  those  approaches.  It  is 
interesting  to  see  if  the  enhancements  are  effective  in  approaches  other  than  Chaitin-style. 

In  this  chapter,  I  compare  various  different  register  allocation  approaches  and  discuss 
their  strengths  and  weaknesses.  In  Section  7.1, 1  contrast  optimistic  coloring  with  Chaitin- 
style  coloring.  In  Section  7.2, 1  discuss  the  similarity  between  priority-based  and  Chaitin- 
style  using  the  three  improvements  of  Chapter  6.  In  Section  7.3,  I  evaluate  a  register 
allocator  using  another  call-cost  model  that  has  been  adopted  by  prior  researchers. 


7.1  Optimistic  versus  non-optimistic 

Optimistic  coloring  [9]  delays  spilling  decisions  of  live  ranges  until  the  live  ranges  actually 
fail  to  find  legal  colors.  Optimistic  coloring  aggressively  tries  to  find  colors  for  those  oth¬ 
erwise  spilled  live  ranges.  If  we  exclude  call  cost  from  the  overhead  operations,  optimistic 
coloring  guarantees  to  deliver  a  result  at  least  as  good  as  Chaitin-style  coloring.  If  none 
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Figure  7.1:  Optimistic  coloring  with  N=2  (1  callee-save  and  1  caller-save). 


of  the  live  ranges  spilled  by  Chaitin-style  coloring  gets  a  color,  then  the  register  overhead 
is  the  same  as  that  of  using  Chaitin-style  coloring.  If  some  otherwise  spilled  live  ranges 
are  assigned  colors,  then  optimistic  coloring  produces  a  superior  result.  However,  if  we 
use  the  actual  cost  model  from  Section  2.1,  which  includes  the  call  cost,  trying  to  squeeze 
more  live  ranges  into  registers  may  not  be  a  good  idea  because  the  call  cost  of  keeping 
live  ranges  in  registers  may  be  higher  than  the  live  ranges  spill  cost.  Figure  7. 1  presents  a 
situation  where  optimistic  coloring  generates  more  register  overhead  operations.  Given  two 
registers,  one  callee-save  and  one  caller-save,  simplification  blocks  because  the  degree  of 
all  live  ranges  is  equal  to  2.  Optimistic  coloring  pushes  all  live  ranges  onto  C  and  is  able  to 
assign  legitimate  colors  for  them.  What  may  happen  is  that  Ir^  and  Ir,  use  the  callee-save 
register,  and  Ir.^  and  hy  use  the  caller-save  register.  In  this  case,  the  high  caller-save  cost 
of  Ir^  causes  an  inferior  result. 

Table  7.1  compares  the  overhead  of  optimistic  and  base  Chaitin-style  coloring  using 
static  and  dynamic  information  to  estimate  execution  frequencies.  The  value  in  each  entry 
is  the  quotient  Base-Chaitin/Optimistic.  The  darkly  shaded  regions  (<  1.00)  highlight  the 
cases  where  optimistic  coloring  results  in  more  overhead  operations  than  base  Chaitin-style 
coloring.  The  lightly  shaded  regions  (>  1.00)  indicate  that  optimistic  coloring  outperforms 
base  Chaitin-style  coloring.  The  (blank)  rest  (=  1.00)  indicates  that  optimisUc  coloring  has 
no  influence.  Surprisingly,  optimistic  coloring  does  not  improve  overhead  operations  for 
most  of  the  cases  and,  in  addition,  deteriorates  the  result  of  register  allocation  more  often  than 
ameliorates  it.  There  are  a  couple  of  reasons  for  that:  (1)  if  the  register  allocator  assigns 
registers  to  the  live  ranges  that  have  high  spill  cost,  assigning  registers  to  the  otherwise 
spilled  live  ranges  that  have  low  spill  cost  makes  only  a  small  difference.  Therefore  we 
notice  only  a  small  improvement.  (2)  If  the  live  ranges  spilled  by  the  base  register  allocator 
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end  up  in  the  wrong  kind  of  registers,  the  overall  effect  is  negative.  Optimistic  coloring 
is  sometimes  effective  for  a  small  number  of  registers  because  the  smaller  the  number  of 
registers  is,  the  more  live  ranges  are  spilled.  Yet  even  in  the  best  cases,  the  influence  of 
optimistic  coloring  is  small  (within  ±  6  %)  except  for  f  pppp  when  using  static  information 
(up  to  36  %  improvement  in  the  number  of  overhead  operations  with  a  small  number  of 
registers). 


Figure  7.2:  Optimistic  versus  non-optimistic  for  f  pppp  (using  static  information). 


I  incorporate  optimistic  coloring  into  improved  Chaitin-style  coloring  (i.e.,  includ¬ 
ing  storage-class  analysis,  benefit-driven  simplification,  and  preference  decision  enhance¬ 
ments).  Except  for  f  pppp  when  using  static  information,  the  results  are  almost  identical  to 
those  obtained  by  improved  Chaitin-style  coloring  alone.  This  result  is  no  surprise  because 
the  improvement  of  optimistic  coloring  is  small,  and  the  storage-class  analysis  spills  the  live 
ranges  that  are  not  worthwhile  residing  in  registers;  this  optimization  may  actually  undo  the 
color  assignment  done  by  optimistic  coloring.  For  instance,  consider  the  example  of  Figure 
7.1  again.  The  storage-class  analysis  checks  the  estimated  saving  (benefit)  of  keeping  alive 
range  in  a  register  and  spills  the  live  range  to  memory  if  the  saving  is  negative.  Ir^  is  then 
spilled  to  memory  in  the  color  assignment  phase.  The  result  is  the  same  as  Chaitin-style 
coloring  that  spills  (r,  when  simplification  blocks.  Figure  7.2  shows  in  more  detail  the 
improvements  due  to  optimistic  coloring,  improved  Chaitin-style  coloring,  and  improved 
Chaitin-style  with  optimistic  coloring  for  f  pppp  using  static  information.  Optimistic  col- 
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Table  7.1;  Optimistic  coloring  versus  Chaitin-style. 


12.  PRIORITY-BASED  VS.  CHAITIN-STYLE 

oring  performs  well  with  a  small  number  of  registers,  but  the  enhancement  drops  as  the 
number  of  registers  increases.  Improved  Chaitin-style  coloring,  on  the  other  hand,  performs 
weil  as  the  number  of  registers  increases,  because  the  more  registers  are  available,  the  more 
freedom  the  register  allocator  has  to  choose  between  caller-save  and  callee-save  registers. 
We  see  the  improvements  of  optimistic  coloring  plus  improved  Chaitin-style  coloring  as  the 
two  are  integrated.  When  the  number  of  registers  is  small,  improved  Chaitin-style  coloring 
is  not  very  effective,  so  we  see  only  the  contribution  of  optimistic  coloring.  As  the  number 
of  registers  increases,  optimistic  coloring  has  less  influence  and  improved  Chaitin-style 
coloring  picks  up,  so  we  see  the  enhancement  due  to  improved  Chaitin-style  coloring. 


7.2  Priority-based  vs.  Chaitin-style 

The  use  of  a  benefit  function  to  sort  live  ranges  invites  a  comparison  with  pnority-based 
coioring  [16].  At  first  sight,  Chaitin-style  coloring  and  priority-based  coloring  appear  to 
have  little  in  common.  Priority-based  coloring  assigns  registers  to  iive  ranges  based  on  a 
priority  function  and  splits  a  live  range  Ir  when  no  legal  register  exists  for  Ir. 

Chaitin-style  and  priority-based  coloring  share  a  major  core  if  priority-based  coloring 
simply  spills  rather  than  splits  live  ranges  when  it  runs  out  of  colors.  That  is,  both 
approaches  try  to  determine  the  color  ordering  in  which  live  ranges  are  assigned  colors 
(registers).  Chaitin-style  coloring  simplifies  the  interference  graph  to  decide  the  ordering 
in  which  live  ranges  are  assigned  colors.  The  color-assignment  phase  then  guarantees  to 
assign  registers  for  ali  non-spilled  live  ranges  based  on  the  ordering.  Priority-based  coloring 
uses  cost  analysis  as  the  priority  to  determine  the  ordering  and  guarantees  that  the  color- 
assignment  phase  assigns  registers  to  the  most  important  (high  savings  of  memory  accesses) 
live  ranges.  The  two  approaches  aim  at  two  different  directions:  (1)  Chaitin-style  coloring 
finds  an  ordering  that  packs  live  ranges  into  registers — ^potentially  it  uses  fewer  colors 
than  priority-based  coloring,  and  (2)  priority-based  coloring  wants  to  make  sure  that  the 
most  important  live  ranges  are  in  registers  even  though  it  may  require  more  colors  (or  spill 
more  unimportant  live  ranges).  Both  approaches  have  their  own  strengths  and  weaknesses. 
Chaitin-style  coloring  with  all  three  enhancements  is  a  hybrid  of  the  two  approaches  that 
uses  Chaitin-style  coloring  as  the  framework  for  packing  live  ranges  and  uses  storage-class 
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analysis  and  benefit-driven  simplification  to  achieve  the  same  effect  as  the  priority-based 
approach. 

7.2.1  Priority  functions 

Reducing  register  overhead  operations  is  the  primary  goal  of  the  priority  function.  Hence, 
the  more  memory  accesses  saved  by  assigning  a  register  to  a  live  range,  the  higher  the 
priority  of  the  live  ranges  is.  The  bigger  a  live  range,  the  more  conflicting  live  ranges 
cannot  use  the  same  register  as  the  live  range.  That  is,  higher  register  pressure  is  introduced 
if  bigger  live  ranges  reside  in  registers.  Higher  register  pressure  is  more  likely  to  cause 
overhead  operations.  Therefore,  the  bigger  a  live  range,  the  lower  the  priority  of  the  live 
range  is.  Based  on  the  two  conditions,  we  use  gs  the 

priority  function  for  priority-based  coloring.  sizc{lr)  is  the  number  of  basic  blocks  that  Ir 
contains.  This  priority  function  is  the  same  as  the  one  used  in  [16].  The  spilling  heuristic 
of  Chaitin-style  coloring  and  optimistic  coloring  is  The  heuristic  tends  to  spill 

live  ranges  that  have  lo\v  spill  cost  and  high  degree.  The  priority  function  and  the  spilling 
heuristic  are  alike.  The  numerator  of  the  priority  function  reflects  the  spill  cost  because  the 
bigger  the  spill  cost,  the  bigger  the  two  benefit  functions  are.  The  denominator  reflects  the 
degree  of  live  ranges  because  the  bigger  the  live  ranges,  the  more  likely  the  live  ranges  have 
higher  degree. 

There  are  a  few  ways  to  determine  the  color  ordering  for  priority-based  coloring: 

•  removing  unconstrained:  Unconstrained  live  ranges  are  removed  from  the  interfer¬ 
ence  graph  and  pushed  onto  C,  The  remaining  live  ranges  are  pushed  onto  C  from  the 
least  priority  to  the  highest  priority.  Chow  uses  the  same  approach  in  [16]. 

•  sorting  unconstrained:  Unconstrained  live  ranges  are  not  pushed  onto  C  in  a  priority 
fashion  if  we  simply  remove  unconstrained  live  ranges.  That  is,  the  kind  of  regis¬ 
ter  that  a  higher-priority  live  range  wants  is  possibly  taken  away  by  lower-priority 
live  ranges.  Sorting  unconstrained  live  ranges  alleviates  the  problem  by  pushing 
unconstrained  live  ranges  onto  C  also  in  a  priority  manner. 

•  sorting:  All  live  ranges  are  purely  sorted  in  terms  of  their  priorities.  The  sorting 
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approach  guarantees  that  the  higher  the  priority,  the  more  likely  that  live  ranges  are 
assigned  their  preferred  kind  of  register. 

All  three  heuristics  produce  nearly  identical  register  overhead  (±  10%)  for  most  cases. 
A  detailed  comparison  is  given  in  Tables  7.2  and  7.3.  RU,  SU,  and  S  indicate  Removing 
Unconstrained,  Sorting  Unconstrained,  and  Sorting,  respectively.  There  are  two  rows  for 
each  SPEC92  program,  RU/SU  and  RU/S.  The  entry  is  the  ratio  of  the  register  overhead  of 
the  two  heuristics,  e.g.,  1.04  for  RU/SU  means  that  the  RU  heuristic  introduces  4%  more 
overhead  operations  than  the  SU  heuristic.  The  empty  entries  in  the  tables  indicate  the 
two  heuristics  generate  the  same  register  overhead  (ratio  is  1.00).  The  experimental  results 
show  that  the  RU  and  SU  heuristics  give  identical  results  except  in  a  few  cases.  S(orting 
approach)  yields  much  better  results  (less  overhead)  for  ear  (both  static  and  dynamic)  and 
espresso  (dynamic). 

Figure  7.3  shows  an  example  where  S  yields  a  better  result  than  RU  and  SU.  Given  N=2 
(1  caUee-save  and  1  caller-save),  Ir,  is  the  only  unconstrained  live  range  (degree  =  1).  RU 
and  SU  then  remove  and  assign  colors  to  the  rest  live  ranges  in  the  priority  order, 

Ir,,  Iru;  and  Ir^.  The  callee-save  register  is  assigned  to  because  Ir^  is  considered  the 
highest  priority  live  range.  As  a  result,  Ir^  can  reside  only  in  the  caller-save  register.  The 
RU  and  SU  approaches  save  5900  load/store  operations.  S  treats  as  the  highest  priority 
live  range  and  assigns  the  caller-save  register  to  lr„.  This  assignment  saves  8500  load/store 
operations. 
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Figure  7.3:  Priority  functions  with  N=2  (1  callee-save  and  1  caller-save). 
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7.2.2  Evaluation 

We  compare  improved  Chaitin-style  coloring  with  priority-based  coloring  using  the  sorting 
heuristic.  We  can  classify  the  SPEC92  programs  based  on  the  results  into  3  classes.  Figures 
7.4, 7.5, 7.6,  and  7.7  show  the  results. 

•  Both  priority-based  and  improved  Chaitin-style  coloring  are  doing  equally  well. 
Examples  arealvinn,  eqntott,  gcc,  and  11  (Figure7.4).  For  alvinn,  packing 
live  ranges  is  important  for  small  numbers  of  registers.  When  the  number  of  registers 
increases,  packing  live  ranges  becomes  less  important  (because  high  spill-cost  live 
ranges  are  most  likely  to  findregisters)  so  we  see  that  the  two  approaches  yield  similar 
results. 

•  Improved  Chaitin-style  coloring  is  superior  to  priority-based  coloring,  compress, 
ear,  sc,  doduc,  nasa7,  spice,  and  tomcatv  fall  into  this  category  (Figures 
7.5  and  7.6).  For  nasa7,  improved  Chaitin-style  coloring  produces  slightly  fewer 
overhead  operations  than  priority-based  coloring  when  the  number  of  registers  is 
small  When  the  two  ^preaches  have  more  registers,  the  results  start  to  diverge. 
Chaitin-style  coloring  produces  a  similar  improvement  trend  in  both  the  dynamic 
and  static  cases.  Priority-based  coloring,  nevertheless,  does  not  improve  overhead 
operations  a  lot  over  the  base  in  the  static  case.  In  this  case,  priority-based  coloring 
introduces  4  times  more  overhead  operations  than  improved  Chaitin-style  coloring. 
The  same  scenario  happens  to  ear  as  well.  For  tomcatv,  the  priority-based 
approach  spills  more  live  ranges  than  the  improved  Chaitin-style  approach,  both  in 
the  static  and  dynamic  cases,  due  to  the  inability  of  packing  live  ranges  densely. 
For  sc,  improved  Chaitin-style  coloring  yields  bigger  improvement  than  priority- 
based  coloring  in  the  dynamic  case.  We,  therefore,  classify  it  into  this  class  although 
priority-based  coloring  using  static  information  has  slightly  fewer  overhead  operations 
than  improved  Chaitin-style  coloring. 

•  There  is  no  clear  winner  between  the  two  approaches.  Examples  of  this  class  are 
espresso,  fpppp  andmatrixSOO  (Figure 7.7).  For  espresso, priority-based 
coloring  is  clearly  superior  to  improved  Chaitin-style  coloring  in  the  dynamic  case 
but  not  in  the  static  case.  If  we  compare  the  results  for  fpppp  (static)  (Figure  7.7) 
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Table  7.2:  Comparison  of  priority-based  heuristics  (using  static  information), 
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Table  7.3:  Comparison  of  priority-based  heuristics  (using  dynamic  information) 


Base-Chaitin/ic  Base-Chaitinyic 
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Figure  7.4:  Priority-based  «  Chaitin-style. 
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Figure  7.6;  Priority-based  <  Chaitin-style  (cont.). 


with  the  results  shown  in  Figure  7.2,  we  notice  a  similarity  between  priority-based 
coloring  and  the  integration  of  improved  Chaitin-style  and  optimistic  coloring.  When 
the  number  of  registers  is  small,  we  see  the  trend  of  optimistic  coloring.  As  the 
number  of  registers  increases,  we  see  the  trend  of  improved  Chaitin-style  coloring. 


From  the  experimental  results,  improved  Chaitin-style  coloring  is  superior  to  priority- 
based  coloring  in  two  ways:  (1)  Chaitin-style  coloring  is  able  to  pack  more  live  ranges  into 
a  set  of  registers,  which  potentially  introduces  fewer  spill  code  operations,  (2)  the  priority 
function  used  in  priority-based  coloring  may  cause  low  spill-cost  live  ranges  to  take  away 
the  preferable  registers  of  high  spill-cost  live  ranges. 

The  example  in  Figure  7.8  illustrates  how  the  second  case  happens  with  N=3  (1  callee- 
save  and  2  caller-save).  The  two  benefit  functions,  benefit. caller  and  bcncfit-callcc,  of 
each  live  range  are  shown  in  the  table.  Priority-based  coloring  considers  Iry  to  have  the 
highest  priority  because  of  high  spill  cost.  Subsequently,  the  callee-save  register  is  assigned 
to  Iry,  which  leaves  the  two  caller-save  registers  for  /r„,.,  Ir^  and  Ir..  This  color  assignment 
results  in  a  bad  register  assignment  because  the  penalty  of  picking  the  wrong  kind  of  register 
for  Ivw  is  high.  The  assignment  ends  up  with  saving  8300  memory  accesses.  Benefit-driven 
simplification,  on  the  other  hand,  can  first  simplifies  either  or  Ir^.  Consequently,  all 
remaining  nodes  are  unconstrained  (degree  =  2).  Benefit-driven  simplification  using  the 
second  strategy  (Section  6.3)  puts  Ir^,  on  the  top  of  the  color  stack  because  /r„,.  has  the 
highest  penalty  of  picking  the  wrong  kind  of  register.  As  a  result,  improved  Chaitin-style 
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coloring  saves  1 1000  memory  accesses.  This  is  the  reason  why  priority-based  coloring 
yields  much  more  overhead  operations  for  nasa7  and  ear  in  the  static  case. 
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Figure  7.8:  Priority-based  coloring  with  N=3  (1  callee-save  and  2  caller-save). 


7.3  Chaitin-style  versus  CBH 

In  this  section,  we  compare  improved  Chaitin-style  coloring  with  another  base  model  that  is 
implemented  in  various  compilers.  This  model  extends  Chaitin-style  coloring  to  explicitly 
model  the  calling  convention.  According  to  Briggs  [8],  this  model  also  extends  hierarchical 
coloring  (as  described  for  the  Tera  compiler  [12]).  Based  on  the  main  components,  we  name 
this  cost  model  “CBH”  (for  Chaitin/Briggs-Hierarchical  coloring).  Although  the  CBH  cost 
model  has  been  adopted  by  prior  researchers,  there  is  no  published  report  of  the  model’s 
effectiveness. 

The  comparison  of  this  section  serves  three  purposes.  First,  it  allows  us  to  know  if 
our  base  Chaitin-style  coloring  is  a  reasonable  base  or  not,  relative  to  a  model  adopted  by 
other  compilers.  Second,  the  comparison  presents  the  effectiveness  of  the  CBH  cost  model. 
Third,  the  comparison  also  shows  if  the  call-cost  directed  Chaitin-style  model  outperforms 
CBH. 

In  the  CBH  approach,  live  ranges  that  cross  calls  interfere  with  all  caller-save  registers. 
In  other  words,  none  of  the  caller-save  registers  can  be  assigned  to  live  ranges  that  contain 
calls.  A  live  range  is  introduced  for  each  callee-save  register  extending  from  the  entrance 
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of  a  routine  to  the  exit  of  the  routine.  The  live  range  is  called  a  callee-save- register  live 
range.  Each  callee-save-register  live  range  has  two  references,  one  at  the  entry  and  the 
other  at  the  exit  (these  represent  save/restore  of  the  callee-save  register  at  the  entry/exit). 
With  the  execution  frequency  of  the  routine,  we  can  get  the  spill  cost  for  the  live  range. 
The  call  cost  is  then  modeled  by  CBH  just  like  regular  spill  cost.  Register  allocation 
performs  simplification  to  determine  if  it  is  worthwhile  using  a  callee-save  register.  When 
simplification  blocks,  one  live  range  with  the  least  spill  cost  is  chosen  from  the  remaining 
live  ranges,  including  the  callee-save-register  live  ranges.  Choosing  a  callee-save-register 
live  range  as  opposed  to  a  regular  live  range  to  spill  means  spilling  each  remaining  ordinary 
live  range  is  more  expensive  than  saving/restoring  the  callee-save  register  at  the  entry/exit. 
Once  a  callee-save  register  is  spilled,  the  callee-save  register  can  then  be  assigned  to  live 
ranges.  This  is  similar  to  the  way  Chow  handles  callee-save  registers  [16].  That  is,  the  first 
live  range  that  uses  a  callee-save  register  pays  the  callee-save  cost. 

I  compare  CBH  with  improved  Chaitin-style  coloring.  To  allow  an  easy  comparison 
of  the  new  figures  with  the  figures  in  Section  6.6, 1  use  the  same  base  line.  The  data  for 
improved  Chaitin-style  coloring  in  Figures  7.9,  7.10,  and  7.11  are  the  data  (SC+BS+PR) 
from  Section  6.6.  Again  the  y-axis  of  all  figures  shows  the  ratio  of  register  overhead 
operations  relative  to  the  register  overhead  produced  by  the  base  Chaitin-style.  The  bigger 
the  number  on  the  y-axis,  the  more  overhead  operations  are  removed. 

We  can  analyze  the  data  from  several  different  perspectives.  CBH  represents  a  model 
of  call  cost  that  is  conceptually  simple  (always  a  good  property).  CBH  does  not  allow  live 
ranges  that  cross  calls  to  use  caller-save  registers  because  they  interfere  with  all  caller-save 
registers.  Those  extra  interference  edges,  however,  may  overly  constrain  register  allocation, 
especially  when  there  is  not  a  sufficient  number  of  callee-save  registers  because  all  live 
ranges  crossing  calls  are  competing  for  callee-save  registers.  Examples  of  programs  that 
exhibit  this  behavior  are  alvinn,  compress,  ear,  espresso,  gcc,  li,  sc,  doduc, 
matrix300,  and  spice.  For  those  SPEC92  programs,  high  spill-cost  live  ranges  that 
cross  calls  are  spilled  because  there  are  only  a  small  number  of  callee-save  registers.  As  the 
number  of  callee-save  registers  increases,  the  constraints  become  less  critical,  and  CBH  is 
able  to  assign  registers  effectively  for  alvinn,  ear,  andmatrix300. 

Then  there  are  cases  where  improved  Chaitin  coloring  and  the  CBH  approach  pro- 
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duce  almost  identical  results,  i.e.  the  same  number  of  overhead  operations,  f  pppp  and 
tomcatv  fall  into  this  category. 

Sometimes  the  CBH  approach  does  not  improve  the  quality  of  register  allocation  using 
profile  information.  Examples  are  compress,  li,  sc,gcc,  doduc,nasa7  and  spice. 
For  those  programs,  we  see  that  CBH  cannot  catch  up  with  improved  Chaitin- style  coloring. 
The  main  reason  for  this  situation  is  that  many  live  ranges  on  the  most  frequently  executed 
paths  also  cross  call  sites  that  are  on  less  frequently  executed  paths.  All  those  live  ranges 
are  competing  for  callee-save  registers  regardless  the  fact  that  the  call  sites  contained  in  the 
live  range  are  not  frequently  executed.  Consequently,  those  live  ranges  are  either  assigned 
callee-save  registers  or  spilled.  As  a  result,  CBH  spills  more  than  necessary  and  incurs  high 
register  overhead  (spill  code)  for  such  programs.  On  the  other  hand,  improved  Chaitin- 
style  coloring  pays  caller-save  cost  at  occasionally  executed  paths  and  assigns  caller-save 
registers  to  live  ranges  that  do  not  cross  calls  on  frequently  executed  paths. 

Another  noteworthy  aspect  is  that  CBH  requires  a  fair  number  of  registers  to  make  up 
the  deficiency  of  its  call  cost  model  for  matrix300  and  nasa7.  The  data  for  improved 
Chaitin-style  coloring  and  CBH  of  matrix300  (dynamic)  show  that  improved  Chaitin- 
style  coloring  continues  improve  beyond  using  (9,7 ,3,3)  registers,  whereas  CBH  still  suffers 
from  the  lack  of  callee-save  registers  until  it  has  4  additional  integer  and  3  additional  floating¬ 
point  callee-save  registers.  nasa7  follows  a  similar  pattern.  For  ear  and  eqntott,  we 
see  a  similar  trend  although  it  is  less  pronounced.  However,  the  situation  is  not  as  severe  as 
for  matrix300  or  nasa7,  and  CBH  needs  only  a  few  more  registers  to  compensate  the 
limitation  of  its  call  cost  model. 

For  ear  and  nasa7,  improved  Chaitin-style  coloring  achieves  a  big  improvement  for 
both  the  static  and  dynamic  cases.  CBH,  however,  produces  roughly  identical  overhead 
operations  as  the  base  Chaitin-style  coloring  in  the  static  case.  From  those  figures,  we  can 
derive  the  conclusion  that  improved  Chaitin-style  coloring  is  superior  to  CBH  and  the  "base 
model"  is  actually  reasonable  after  all.  For  a  lot  of  register  ranges,  the  base  Chaitin-style 
allocator  is  actually  superior  to  the  CBH-style  allocator. 


Base-Chaitin/s 
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Figure 7.1 1:  Improved  Chaitin-style  versus  CBH  (Cont.), 


7.4  Summary 


In  this  chapter,  I  evaluate  the  effectiveness  of  optimistic  coloring.  The  results  indicate  that 
the  influence  of  optimistic  coloring  is  small  in  general.  In  addition,  optimistic  coloring 
needs  to  take  call  cost  into  account.  Otherwise,  the  overall  effect  may  be  negative. 

From  the  experimental  results,  improved  Chaitin-style  coloring  is  superior  to  priority- 
based  coloring.  Chaitin-style  coloring  is  able  to  pack  more  live  ranges  into  a  set  of  registers. 
The  enhancements  described  here  allow  Chaitin-style  allocation  to  order  live  ranges,  thereby 
achieving  simiiar  benefits  as  can  be  obtained  from  the  priority  function  of  priority-based 
coloring.  As  a  result,  low  spill-cost  live  ranges  do  not  take  away  the  registers  preferred  by 
high  spill-cost  live  ranges. 

Coloring  based  on  the  CBH  cost  model  imposes  the  constraint  of  not  assigning  caller- 
save  registers  to  live  ranges  containing  calls.  Due  to  this  constraint,  callee-save  registers 
then  become  the  most  critical  resources  in  a  lot  of  cases.  As  a  result,  the  caller-save 
registers  may  be  underutilized;  live  ranges  containing  calls  are  spilled  although  the  calls 
may  be  executed  only  infrequently. 

One  interesting  aspect  of  register  allocation  is  that  each  model  hits  some  aspects  of 
dealing  with  call  cost.  There  are  some  programs  (or  some  register  ranges  for  some  programs) 
where  the  priority-based  and/or  CBH-style  approaches  perform  reasonably  well,  as  there 
are  situations  where  optimistic  allocation  performs  well.  One  of  the  challenges  is  to  devise 
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an  approach  that  combines  the  benefits  of  all  models,  and  I  demonstrated  that  storage- 
class  analysis,  benefit-driven  simplification,  and  preference  decision  can  be  integrated  into 
Chaitin-style  coloring.  This  improved  Chaitin-style  register  allocator  provides  an  effective 
way  to  deal  with  modern  calling  conventions  for  a  wide  range  of  programs. 
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Chapter  8 
Extensions 


There  are  two  important  techniques  of  register  allocation  that  have  not  yet  been  covered 
so  far:  coalescing  and  rematerialization.  Both  techniques  improves  the  quality  of  register 
allocation.  Coalescing  tries  to  eliminate  unnecessary  copy  instructions.  Rematerialization 
replaces  spill  code  by  instructions  that  recompute  the  values.  It  is  important  to  see  how 
coalescing  and  rematerialization  interact  with  fusion-based  register  allocation.  There  are  a 
couple  of  interesting  questions  that  we  would  like  to  be  able  to  answer.  Can  fusion-based 
register  allocation  implement  rematerialization  (Section  8.1)?  Can  fusion-based  register 
allocation  improve  coalescing  quality  (Section  8.2)?  In  this  chapter,  we  discuss  each  of 
these  aspects  in  detail. 


8. 1  Rematerialization 

Most  modern  processors  require  multiple  cycles  to  access  the  cache,  i.e.,  a  memory  access 
needs  more  than  one  cycle;  on  the  other  hand,  several  values  can  often  be  recomputed  using 
a  single  instruction.  Rematerialization  recognizes  that  some  values  are  cheaper  (require 
fewer  cycles)  to  recompute  than  to  spill  to  memory  and  uses  the  instructions  that  recompute 
the  values  to  replace  the  spill  code.  Because  the  register  allocator  knows  the  value  of  a 
rematerializable  live  range,  to  spill  a  live  range,  it  is  not  necessary  to  insert  a  store  after 
each  definition  of  the  live  range.  Then  instead  of  inserting  a  load  before  each  use  of  the  live 
range,  register  allocation  places  there  the  instruction  that  recomputes  the  value.  [10]  lists 
opportunities  for  rematerialization: 
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•  immediate  loads  of  constants:  For  a  machine  that  has  the  immediate  addressing  mode, 
immpdiatp.  loads  of  integer  constants  can  be  accomplished,  e.g.,  using  li  on  MIPS, 
set  on  SPARC,  or  Idil  on  Alpha  processors. 

•  computing  effective  address:  The  constant  offset  of  the  stack  frame  can  be  computed 
using  add  with  immediate  addressing  mode,  e.g.  "add  srl,  fp,  64". 

•  loads:  The  value  of  a  load  from  a  known  constant  location  in  either  the  stack 
frame  or  the  static  data  area  can  be  reloaded  using  one  load  instruction,  e.g.,  "load 
srl ,  64  ( f  p ) ".  Although  the  load  instruction  is  as  expensive  as  the  spill  load,  we 
save/eliminate  spill  stores  that  write  the  value  to  the  spill  location. 

•  loading  non-local  frame  pointer  from  a  display. 


The  cost  of  spilling  a  rematerializable  live  range  ought  to  be  modeled  differently  from  a 
live  range  whose  value  can  not  be  rematerialized.  For  instance,  fusion-based  coloring  uses 
one  of  the  three  heuristics  in  [5]  to  choose  one  live  range  from  the  set  of  constrained  live 
ranges.  The  heuristic  is  where  cost{lr)  is  a  function  of  spilLcost{lr)  (the 

weighted  reference  count).  Assume  that  a  memory  access  takes  two  cycles.  The  cost(lr) 
is  redefined  as  follow: 


r  rematcrializationxost{lr))  if  Ir  is  rematerializable 
cost{lr)  -  I  spill _cost{lr)*2  Otherwise 

rcmatcrializationjcost{lr)  is  the  weighted  cycle  count  of  recomputing  the  value  of  Ir. 


(a)  (b)  (c) 


Figure  8.1:  Spilling  a  rematerializable  live  range. 
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Consider  the  two  live  ranges,  Ir^  and  Iry,  in  Figure  8.1(a).  Both  a;  and  y  are  defined  in 
Z?i  and  referenced  once  in  Dz.  Ir^  and  try  have  the  same  spilLcost.  The  numbers  next  to 
control-flow  edges  are  execution  frequencies.  Spilling  a;  or  y  to  memory  incurs  220  cycles. 
Figure  8.1(b)  shows  the  code  after  we  spill  y.  try  can  not  be  rematerialized  because  the 
cost  of  recomputing  "y=m/3"  is  expensive.  Ir^  can  be  rematerialized  because  recomputing 
address  a  can  be  done  using  one  instruction.  The  rematerialization.cost  of  is  100  cycles. 
Assume  that  the  register  allocator  requires  to  spill  either  Ir^  or  try.  The  register  allocator 
recognizes  that  rematerializing  Ir^  is  cheaper  than  spilling  Ivy  using  the  redefined  spilling 
heuristic  (depicted  in  Figure  8.1(c)). 


8.1.1  Prior  approaches 

Chaitin  et  al.  proposed  and  implemented  the  rematerialization  technique  in  the  IBM  PL.  8 
compiler  [13].  Briggs  et  al.  [10]  show  that  there  are  cases  where  the  value  of  a  live  range 
is  rematerializable  in  some  parts  of  the  live  range,  but  not  the  whole  live  range.  The 
rematerialization  technique  in  the  IBM  PL.8  compiler  was  not  integrated  with  live-range 
splitting.  Hence,  once  the  rematerializable  value  of  a  live  range  is  killed  within  the  live 
range,  they  give  up  potential  opportunities  of  rematerializing  some  parts  of  the  live  range 
and  spilling  the  rest. 

Briggs  et  al.  [10]  extend  Chaitin’s  approach  by  splitting  each  live  range  into  its  com¬ 
ponent  values  using  the  SSA  representation  [20]  so  that  each  live  range  has  exactly  one 
definition.  The  goal  is  to  isolate  rematerializable  values  so  as  to  expose  all  possible  oppor¬ 
tunities  of  rematerialization.  Wegman  and  Zadeck’s  sparse  simple  constant  algorithm  [62] 
is  modified  to  propagate  rematerialization  tags  to  each  value.  Because  splitting  each  live 
range  into  its  component  values  may  cause  excessive  shuffle  code  (moves),  conserva¬ 
tive  coalescing  and  biased  coloring  techniques  are  used  to  eliminate  unproductive  moves. 
Again  the  shortcoming  of  their  approach  is  that  splitting  decisions  are  made  prematurely. 
Conservative  coalescing  and  biased  coloring  techniques  do  not  guarantee  to  remove  all 
unproductive  moves.  In  addition,  splitting  points  may  be  on  frequently  executed  paths. 
Failing  to  eliminate  those  splitting  points  causes  a  lot  of  excessive  shuffle  code. 
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S.1.2  Rematerialization  in  fusion-style  coloring 

This  section  presents  how  rematerialization  can  be  implemented  in  fusion-style  register 
allocation.  There  are  3  phases  in  which  fusion-style  coloring  deals  with  rematerization: 
propagating  rematerialization  tags,  graph  construction,  and  shuffle  code  insertion.  A  new 
phase  is  inserted  before  the  graph  construction  phase  to  propagate  rematerialization  tags. 
Therefore,  when  constructing  a  live  range,  we  know  if  the  live  range  is  rematerializable  or 
not. 

Propagating  rematerialization  tags 

The  modified  Wegman  and  Zadeck’s  algorithm  used  in  [10]  does  not  fit  into  our  model 
because  the  actual  splitting  decisions  are  not  yet  made  and  live  ranges  are  not  yet  built.  We 
can  use  data-flow  analysis  to  propagate  rematerialization  tags.  The  data-flow  equations  are 
defined  as  follows: 

RcmIn{D)  =  |J  Rc7iiOut{J) 

Jerred{B) 

RcmOut{D)  =  {Rc7nIn(B)  •  -^Rc7nKill{D))  +  Rc7iiGcn{D) 

Rc7nGc7i{B)  =  \  inst:^  e  Z?  A  the  value  of  is  never  killed  in  B] 

Rc7i%Kill{B)  —  [inst^  \  dcf^  e  B] 

The  notation  mst^  represents  an  instruction  whose  target  is  virtual  register  x  and 
the  value  is  rematerializable,  e.g.  "add  x,  fp,  64".  €  B  indicates  that  the 

instruction  shows  up  in  the  schedule  of  block  B.  dcfx  €  B  denotes  that  there 
is  an  instruction  in  B  whose  target  is  virtual  register  x.  Hence,  RcmGc7i{B)  is  the  set 
of  all  instructions  that  appear  in  block  B  and  whose  values  are  rematerializable  and  not 
killed/modified.  Rc7iiKill{B)  is  the  set  of  all  instructions  whose  values  are  rematerializable 
and  targets  are  defined  in  B. 


Graph  construction 

Once  we  have  computed  the  data-flow  analysis  for  rematerialization,  we  can  set  the  rema¬ 
terialization  attribute  for  each  live  ranges.  The  attribute  for  a  live  range  that  does  not  span 
across  blocks  (local)  can  simply  be  determined  by  checking  if  the  value  of  the  live  range  is 
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rematerializable  or  not.  If  a  live  range  Iriix)  is  live  at  the  entry  of  a  block  Du  and  there 
is  only  one  definition  inst^  in  RcmIn{Di),  the  rematerializaUon  attribute  of  Iri(x)  is  set. 
Likewise,  for  a  live  range  lri{x)  that  is  defined  in  a  block  and  live  at  the  exit  of  a  block 
Bu  the  rematerialization  attribute  of  lri{x)  is  set  if  inst,  is  in  RcmOut{D,)  (there  can 
only  be  at  most  one  instx  from  B\  in  ReTnOut{D\)). 

The  rematerialization  attribute  is  propagated,  like  has-def,  during  the  graph  fusion 
phase.  When  two  live  ranges,  Iri  and  In  are  combined,  the  rematerialization  attribute  of 
the  combined  live  range  Irn  is  non-rematerializable  if  one  of  them  is  non-rematerializable. 

rcmaterialization(lri2)  =  rcmatcrialization{lr\)  A  rematcrialization{lr2) 


Shuffle  code  insertion 

On  an  edge  E  =  {Du  D2),  shuffle  code  is  inserted  for  a  virtual  register  a;  that  has  been  split 
on  E,  i.e.,  if  lri{x)  and  lr2ix)  have  not  been  assigned  the  same  storage  location.  Again 
we  use  the  notations  rl,  and  rl,  to  indicate  the  two  registers  that  are  assigned  to  ln{x) 
and  lr2{x)  respectively.  There  are  three  cases  to  consider  when  register  allocation  inserts 
shuffle  code  for  rematerializable  live  ranges. 

•  shuffle  store  (.lr2ix)  is  spilled):  A  shuffle  store  was  supposed  to  be  inserted  on  edge  E 
to  store  the  value  of  fri(a;)  as  shown  in  Figure  8.2(a).  If  lr2ix)  is  rematerialized,  then 
the  shuffle  store  is  not  necessary  and  removed  (as  depicted  in  Figure  8.2(d))  because 
lr2{x)  does  not  access  memory. 

•  shuffle  load  (Iri(x)  is  spilled):  A  shuffle  load  was  supposed  to  be  inserted  on  edge  E 
to  load  the  value  of  lr\{x)  as  shown  in  Figure  8.2(b).  If  lri{x)  is  rematerialized,  the 
shuffle  load  is  then  replaced  with  the  instruction  that  recomputes  the  value  of  Ir^ix) 
as  depicted  in  Figure  8,2(e). 

•  shuffle  move:  A  shuffle  move  was  supposed  to  be  inserted  on  edge  E  to  copy  the 
value  of  Iri(x)  to  lr2{x)  as  shown  in  Figure  8.2(c).  If  is  rematerializable  and 
the  definitions  of  Ir^ix)  can  be  eliminated,  the  shuffle  move  is  then  replaced  with  the 
instruction  that  recomputes  the  value  of  (ri  (s)  (as  depicted  in  Figure  8.2(f)).  Section 
8.1.3  lists  the  conditions  for  eliminating  definitions  of  a  rematerializable  live  range. 
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8.1.3  Optimization  of  rematerialization 

There  is  one  optimization  that  we  can  do  to  improve  the  result  of  rematerialization  by 
eliminating  the  instruction  inst^  of  a  live  range  .  Removing  inside  of  Ir^  must  meet  three 
conditions: 

•  Ivx  is  rematerializable. 

•  Ivx  contains  no  use  (only  the  definitions). 

•  Every  spilled  adjacent  live  range  of  that  uses  Ir^’s  value  is  rematerialized  (no 
shuffle  store  is  inserted). 

8.1.4  Examples  of  rematerialization 

The  example  of  Figure  8.3(a)  is  taken  from  [10].  Assume  that  we  use  loops  (or  blocks)  as 
regions.  Register  allocation  first  performs  the  data-flow  analysis  of  propagating  rematerial¬ 
ization  tags;  then  it  identifies  that  ln{p)  and  Ir^ip)  are  rematerializable,  but  not  lr-i{p)  (two 
distinct  definitions  reach  B3).  Consider  the  case  where  f?2  has  a  lot  of  register  pressure,  but 
Z?i  and  i?3  don’t.  The  register  allocator  chooses  lr2ip)  to  spill  and  rematerializes  it  during 
the  graph  simplification  phase.  After  fusing  graphs,  we  obtain  the  result  in  Figure  8.3(b). 
Live  ranges  of  p  are  split  on  <Di  ,D2>  and  <Z?2,1?3>.  The  shuffle  store  on  <i7i,f?2>  is  elim¬ 
inated;  the  shuffle  load  on  <D2,Di>  is  replaced  with  "p=Label"  (in  B')  which  recomputes 
the  value  of  p.  The  instmetion  "p=Label"  in  Bi  satisfies  all  three  conditions  and  is  removed. 
Consequently,  fusion-based  register  allocation  with  the  rematerialization  analysis  is  able  to 
obtain  the  result  in  Figure  8.3(c)  which  is  the  "ideal"  result  described  in  [10]. 

Moreover,  if  the  register  allocator  keeps  track  of  the  three  conditions,  the  register 
allocator  can  perform  the  optimization  and  removes  Ir^  while  fusing  graphs  rather  than 
after  fusing  all  graphs.  Removing  Irx  during  the  graph  fusion  phase  alleviates  register 
pressure  because  Ir^  does  not  occupy  one  register  resource.  As  a  result,  subsequent  graph 
fusion  could  benefit  from  the  optimization  (the  resulting  graph  contains  one  live  range  fewer 
than  otherwise). 

The  example  in  Figure  8.4(a)  shows  a  case  where  the  fusion-style  approach  provides  a 
better  framework  than  Briggs  et  al.  The  code  consists  of  two  loops.  The  number  next  to 


174 


CHAPTERS.  EXTENSIONS 


(a)  (b)  (c) 


Figure  8.3:  Example  of  rematerialization. 


a  control-flow  edge  is  execution  frequency.  We  notice  that  the  code  is  not  loop  intensive. 
On  average,  the  trip  count  of  the  first  loop  is  2;  the  trip  count  of  the  second  loop  is  4.  The 
shaded  region  is  the  most  frequently  executed  path.  Assume  that  the  code  in  B3  causes  high 
register  pressure,  and  register  allocation  needs  to  spill  one  live  range.  Briggs  et  al.  split  the 
live  range  of  p  on  <Ds,B6>  and  spill  p  to  memory  in  Z?i,  B2,  B3,  B4,  and  Bs.  The  splitting 
and  spilling  decisions  result  in  1 10  overhead  cycles  (100  cycles  of  shuffle  cost  on  <Bs,B6> 
and  10  cycles  for  "p=Label"  of  B3)  as  shown  in  Figure  8.4(b). 

If  fusion-style  coloring  chooses  loops  as  regions,  shuffle  code  on  <Bs,B6>  is  inevitable 
because  register  pressure  of  B3  causes  the  live  range  of  p  to  be  spilled  to  memory  in  the 
first  loop,  like  the  result  of  Figure  8.4(b).  Hence,  there  are  100  overhead  cycles  (100  for 
shuffle  code,  10  for  "p=Label"  of  B3,  and -10  for  eliminating  "p=Label"  of  B\).  However, 
if  fusion-style  coloring  chooses  blocks  as  regions,  the  graph-fusion  framework  is  able  to 
identify  and  isolate  the  troublesome  block  B3,  and  therefore  fusion-style  coloring  spills  p 
only  in  B3.  Consequently,  there  are  only  20  overhead  cycles  (10  for  "p=Label"  of  B3  and 
10  for  shuffle  cost  on  <B3,Bs>  as  shown  in  Figure  8.4(c)). 
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(a)  (b)  (c) 

Figure  8.4:  Example  of  shuffle  code. 


8.2  Coalescing 

During  compilation,  a  compiler  may  generate  unnecessary  copy/move  instructions.  The 
compiler  can  eliminate  a  copy  instruction  that  moves  a  data  value  between  two  non¬ 
conflicting  live  ranges  by  coalescing  the  two  live  ranges  into  one.  In  general,  copy/move 
instructions  originate  from  4  different  sources: 

•  users'  code:  Users  may  write  an  assignment  ”x  =  y"  which  copies  y’s  value  to  x. 
Global  optimization  eliminates  the  copy  assignment  by  performing  copy  propagation 
and  dead  code  elimination.  Copy  propagation  first  propagates  y  to  every  use  of  x  if 
possible.  Then  dead  code  elimination  eliminates  the  assignment  if  all  uses  of  x  are 
replaced  by  y.  The  assignment  remains  if  not  all  uses  of  x  are  replaced  by  y. 

•  global  optimization:  Global  optimization  may  produce  copy  assignments.  For  in¬ 
stance,  the  ”a+b‘'  expression  of  statement  "x  =  a+b“  may  be  a  common  subexpres¬ 
sion.  The  statement  is  then  replaced  by  a  copy  assignment  "x  =  tmp"  where  imp 
is  a  compiler  generated  temporary  to  hold  the  value  a+b. 

•  calling  convention:  The  protocol  of  passing  parameters  and  function  return  values 
between  a  caller  and  callee  is  called  calling  convention.  That  is,  with  a  given  calling 
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convention,  a  caller  knows  where  to  save  outgoing  parameters  and  retrieve  function 
return  values.  A  callee  knows  where  to  retrieve  incoming  parameters  and  save  its 
return  value.  Most  modern  processors  have  a  few  registers  reserved  for  the  calling 
convention  (argument  and  function  return  registers).  If  an  outgoing  or  incoming 
parameter  resides  in  a  register,  moving  the  outgoing  parameter  to  an  argument  register 
or  moving  an  argument  register  to  the  incoming  parameter  is  a  copy  instruction. 
Likewise,  passing  function  return  values  may  also  require  copy  instructions. 

•  live-range  splitting:  If  register  allocation  splits  live  ranges,  shuffle  moves  may  be 
needed  to  move  values  at  splitting  points. 

When  two  live  ranges,  Iri  and  lr2,  are  coalesced  into  one,  the  two  sets  of  interference 
edges,  '.p{lri)  and  '.pilri),  are  combined  as  well.  Apparently,  the  interference  gr^h  after 
coalescing  may  have  different  register  pressure.  Coalescing  may  reduce  register  pressure 
as  well  as  increase  register  pressure. 


Figure  8.5:  Coalescing  increases  register  pressure  (N=3). 


Figure  8.5  shows  a  case  where  coalescing  two  live  ranges  increases  register  pressure. 
With  N=3,  we  can  simplify  the  interference  graph  on  the  left  hand  side  by  first  removing 
In  and  /ra.  Then  the  remaining  three  live  ranges  become  unconstrained  and  are  removed 
from  the  graph.  The  graph  can  be  colored  using  3  colors.  Coalescing  Iri  and  Iri  produces 
a  clique  of  size  4,  which  cannot  be  colored  using  3  colors. 

Figure  8.6  shows  a  case  where  coalescing  two  live  ranges  decreases  register  pressure. 
There  is  a  3-color  solution  for  the  interference  graph  (red  color  for  Ir^  and  Ir^,  green  color 
for  Iry,  and  blue  color  for  Iri,  Iri,  Ivt,  and  Ir^).  Before  coalescing,  the  graph  can  not 
be  simplified  because  simplification  blocks  as  soon  as  is  removed  from  the  graph  (all 
remaining  nodes’  degrees  ^  3).  Thus,  register  allocators  that  rely  on  simplification  to 
determine  the  color  ordering  fail  to  color  the  graph  with  N=3.  Optimistic  coloring  does 
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Figure  8.6:  Coalescing  decreases  register  pressure  (N=3). 


not  guarantee  to  find  a  3-color  solution.  For  example,  simplification  of  the  graph  before 
coalescing  blocks  as  soon  as  Irt  is  removed  from  the  graph.  Optimistic  coloring  may  decide 
to  defer  spilling  Ir^  and  remove  it  from  the  graph  and  tries  to  find  an  available  color  for 
/ra;  later.  One  legitimate  ordering  of  simplifying  the  remaining  graph  is  Iru  Ir^  Iry, 
and  Ir^^.  When  assigning  colors  to  live  ranges  in  the  reverse  order,  optimistic  coloring  may 
assign  the  red  color  to  and  Ivy,  the  green  color  to  /ra,  and  the  blue  color  to  Irt,  Ir-w, 
and  Iri,  As  a  result,  there  is  no  available  color  for  Ir^,  After  coalescing  Iri  and  /ra,  3 
duplicate  edges,  [/ria,/r^.],  [IrnMyl  and  [/ria,/ra:]  are  eliminated,  and  a  graph  G  which  is 
less  constrained  than  the  original  graph  is  obtained.  The  graph  G  can  be  simplified  with 
N=3  by  removing  nodes  in  the  order  of  Irt,  Ir..,  Irn,  Ir^,,  Ir^,  and  Wy, 

8.2.1  Prior  approaches 

Chaitin  et  al.  eliminate  a  move  instruction  when  the  live  ranges  of  the  source  and  destination 
do  not  conflict  [13].  After  all  possible  desirable  copy  instructions  are  eliminated,  the  entire 
interference  graph  is  rebuilt  from  scratch.  Then  Chaitin  et  al.  rerun  the  coalescing  module 
because  some  copy  instructions  that  at  first  were  impossible  to  be  eliminated  may  now  be 
redundant.  The  coalescing  module  is  reiterated  until  no  more  copy  instruction  is  eliminated. 
The  problem  with  the  approach  is  that  coalescing  live  ranges  may  increase  register  pressure 
as  shown  in  Figure  8.5.  Therefore,  coalescing  may  render  the  final  resulting  graph  more 
constrained,  and  register  allocation  results  in  more  spill  code  than  without  coalescing. 

Briggs  uses  coalescing,  conservative  coalescing,  and  biased  coloring  to  eliminate  copy 
instructions  [7].  All  possible  copy  instructions  except  shuffle  moves  are  coalesced  using 
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the  approach  of  Chaitin  et  al.  The  main  goal  of  splitting  live  ranges  is  reducing  register 
pressure.  Shuffle  moves  are  coalesced  conservatively  so  as  to  prevent  introducing  extra  spill 
code.  The  conservative  coalescing  approach  coalesces  two  live  ranges  under  the  constraint 
that  the  resulting  live  range  has  less  than  N  neighbors  whose  degree  >  N.  The  constraint 
makes  sure  that  the  resulting  live  range  is  not  spilled.  Biased  coloring  is  a  final  attempt  to 
clean  up  unproductive  shuffle  moves  by  assigning  the  same  colors  to  live  ranges  that  the 
moves  connect.  Conservative  coalescing  may  suppress  coalescing  two  live  ranges  under 
the  condition  that  the  resulting  live  range  may  be  spilled.  In  fact,  the  resulting  live  range 
is  not  spilled.  Namely,  conservative  coalescing  is  too  conservative.  Consider  the  example 
of  Figure  8.6.  After  we  coalesce  In  and  In  into  Im,  we  can  simplify  the  resulting  without 
blocking.  Conservative  coalescing,  on  the  other  hand,  believes  that  Irn  will  be  spilled 
because  there  are  3  neighbors,  /r*,  Zr„,  and  ir.,  whose  degrees  >  3. 

George  and  Appel  take  an  iterated  coalescing  approach  that  integrates  coalescing  into 
simplification  [28].  Live  ranges  are  categorized  as  either  move-related  or  non-move-related. 
A  move-related  live  range  copies  a  value  to/from  other  live  ranges.  Simplification  removes 
only  non-move-related  live  ranges.  Simplification  blocks  when  the  residual  graph  consists 
of  only  move-related  live  ranges  or  non-move-related  ones  whose  degrees  >  N.  George  and 
Appel  then  perform  conservative  coalescing  on  the  residual  graph.  The  iterated  coalescing 
approach  is  likely  to  coalesce  more  live  ranges  than  the  conservative  approach  because  the 
residual  graph  is  less  constrained  than  the  entire  graph.  A  move-related  live  range  may 
become  non-move-related  after  coalescing.  The  iterated  coalescing  approach  reiterates 
from  simplification  to  remove  non-move-related  live  ranges.  If  no  live  ranges  can  be 
coalesced,  and  all  non-move-related  live  ranges  have  degrees  >  N,  then  a  move-related  live 
range  of  low  degree  is  chosen  to  be  considered  as  non-move-related.  That  is,  the  iterated 
approach  gives  up  the  hope  of  coalescing  the  live  range  with  others  and  then  reiterates  from 
simplification  again. 

8.2.2  Coalescing  in  fusion-style  coloring 

Presently,  our  register  allocator  uses  different  approaches  to  eliminate  copy  instructions  for 
Chaitin-style  and  fusion-style  coloring.  For  Chaitin-style  coloring,  we  coalesce  two  live 
ranges  connected  by  a  move  if  they  don’t  interfere.  Fusion-style  coloring  splits  live  ranges 
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only  when  necessary  to  reduce  spill  code.  Hence,  coalescing  live  ranges  at  splitting  points 
is  unnecessary  because  coalescing  them  increases  spill  code  (undoes  splitting  decisions). 
Therefore,  fusion-style  coloring  uses  biased  coloring  to  eliminate  copy  instructions. 

Copy  instructions  originating  from  users’  code  and  global  optimization  are  actually 
artificial  splitting  points  and  may  be  on  frequently  executed  paths.  Using  biased  coloring  to 
eliminate  these  two  kinds  of  copy  instructions  may  not  be  ideal  because  biased  coloring  may 
fail  to  eliminate  a  frequently  executed  copy  instruction.  Fusion-style  coloring  is  sensitive 
to  edge  ordering  and  therefore  able  to  handle  high  register  pressure  well  by  splitting  live 
range  at  infrequently  executed  edges.  These  nice  properties  allow  us  to  have  a  different 
viewpoint  on  coalescing  the  two  kinds  of  copy  instructions.  We  can  aggressively  coalesce 
live  ranges  without  worrying  much  about  the  incurred  register  pressure  because  fusion-style 
coloring  is  able  to  alleviate  the  pressure  by  spilling  transparent  or  splitting  live  ranges.  One 
feasible  implementation  of  this  aggressive  coalescing  approach  is  performing  Chaitin-style 
coalescing  prior  to  fusion-style  coloring  as  shown  in  Figure  8.7.  Chaitin-style  coalescing 
contains  two  steps,  graph  construction  and  aggressive  coalescing.  The  graph  construction 
step  builds  the  function- wide  graph  G f  unction-  If  live  ranges  are  coalesced  in  the  second 
step,  then  G /unction  is  thrown  away,  and  the  two  steps  are  repeated  until  no  more  live  ranges 
can  be  coalesced. 


Figure  8.7:  Structure  of  coalescing. 


8.3  Summary 

In  this  chapter,  I  present  how  to  integrate  rematerialization  and  coalescing  with  fusion- 
style  coloring.  Rematerialization  and  fusion-style  coloring  cooperate  with  each  other 
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seamlessly.  Fusion-style  coloring  is  able  to  isolate  high  register-pressure  regions  so  as 
to  avoid  rematerialization  happening  on  frequent  executed  paths.  Rematerization  helps 
fusion-style  coloring  by  pruning  useless  rematerializable  live  ranges  out  of  the  interference 
graph  (alleviates  register  pressure).  Because  of  the  ability  of  handling  high  register  pressure 
in  fusion-style  coloring,  it  is  feasible  to  eliminate  frequently  executed  copy  instructions  by 
spilling  transparent  live  ranges  or  splitting  live  ranges  at  infrequently  executed  places. 


Chapter  9 
Conclusions 


9.1  Summary 


This  dissertation  has  proposed  a  novel  register-allocation  approach  which  is  fusion-based. 
Fusion-style  register  allocation  starts  off  with  constructing  regions  and  applies  graph  fusion 
along  control-flow  edges  to  combine  the  interference  graphs  of  regions  into  the  interference 
graph  for  the  whole  function.  Graph  fusion  integrates  spilling,  splitting,  and  color  binding 
in  a  seamless  fashion.  The  delayed  spilling  technique  avoids  making  premature  spilling 
decisions.  Maintaining  the  colorability  invariant  splits  live  ranges  only  when  necessary.  No 
color  binding  decisions  are  ever  made  during  graph  fusion.  More  importantly,  fusion-style 
coloring  is  sensitive  to  the  ordering  of  control-flow  edges  that  connect  regions.  Splitting 
is  unlikely  to  happen  at  high  priority  edges  (frequently  executed)  and  likely  to  happen 
at  low  priority  edges  (infrequently  executed).  As  to  fusion-style  coloring,  the  register- 
allocation  problem  becomes  a  problem  of  determining  the  edge  ordering  and  selecting 
regions.  Fusion-style  coloring  provides  a  nice  framework  for  register  allocation  which 
models  various  live-range  splitting  approaches  such  as  the  Tera,  SSA,  PDG,  and  Multiflow 
approaches. 

This  dissertation  has  presented  how  rematerialization  and  coalescing  can  be  integrated 
with  fusion-style  coloring.  Fusion-style  coloring  is  able  to  leverage  rematerialization  by 
isolating  the  high  register  pressure  region  and  avoiding  inserting  rematerialization  code  into 
frequently  executed  paths.  Rematerialization  is  also  able  to  leverage  fusion-style  coloring 
by  pruning  unnecessary  rematerializable  live  ranges  during  graph  fusion  to  alleviate  register 
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pressure  so  as  to  benefit  subsequent  graph  fusion  by  lowering  register  pressure.  Fusion-style 
coloring  allows  register  allocation  to  split  live  ranges  at  infrequently  executed  edges  or  spill 
low-cost  live  ranges  to  trade  off  against  frequently  executed  copy  instructions. 

It  is  very  important  to  look  at  register  allocation  from  the  compiler’s  perspective  because 
the  compiler  is  more  interested  in  how  the  register  allocator  interacts  with  the  rest  of 
the  compiler.  This  dissertation  has  shown  that  fusion-style  register  allocation  provides  a 
framework  that  is  well-prepared  for  region-based  compilation.  Fusion-style  coloring  is 
able  to  deal  with  the  register  pressure  caused  by  scope  expanding  transformations  such  as 
unrolling  and  inlining,  whereas  the  traditional  Chaitin-style  coloring  breaks  down.  In  other 
words,  fusion-style  coloring  provides  a  solution  to  region-based  compilation  so  that  register 
allocation  is  no  longer  an  obstacle  to  region-based  compilation. 

This  dissertation  has  presented  and  evaluated  several  enhancements  to  register  alloca¬ 
tion.  Chaitin-style  register  allocation  with  these  enhancements  integrates  both  Chaitin-style 
and  priority-based  coloring  into  one  so  that  one’s  strengths  compensates  the  other’s  weak¬ 
nesses.  The  experimental  results  have  shown  that  Chaitin-style  register  allocation  with  the 
enhancements  actually  outperforms  some  existing  approaches  that  have  been  adopted  by 
researchers. 

9.2  Future  work 

Even  though  this  dissertation  has  demonstrated  that  fusion-style  coloring  is  superior  to 
other  register-allocation  approaches,  there  still  remain  some  future  work  to  be  done.  Two 
important  areas  require  investigations  and  evaluations: 

First,  integrating  fusion-style  coloring  into  interprocedural  register  allocation  is  possible 
and  may  provide  pleasing  results.  Consider  a  call  graph  with  3  functions,  /,  g,  and  h,  f 
calls  g\  g  calls  h.  Assume  that  h  is  compiled  and  requires  2  callee-save  registers,  pn  and 
pr2>  Interprocedural  register  allocation  can  treat  pri  and  prz  as  caller-save  register  instead 
of  callee-save.  If  g  uses  the  two  registers,  then  save/restore  code  are  have  to  be  inserted 
around  the  call  site  h().  However,  the  call  site  h{)  may  be  frequently  executed.  Because 
fusion-style  coloring  is  edge-ordering  sensitive,  it  provides  a  nice  framework  to  deploy 
register  save/restore  code  on  less  frequently  executed  paths.  The  places  of  saving/restoring 
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pri  and  prz  can  be  moved  further  away  from  the  call  site  h()  as  graphs  are  fused. 

Second,  another  important  area  is  to  integrate  fusion-style  coloring  with  code  scheduling. 
If  there  are  more  than  N  non-transparent  live  ranges  simultaneous  live  at  one  point,  we 
know  that  the  register  allocator  must  spill  some  live  ranges  to  lower  the  register  pressure.  If 
the  code  scheduler  is  sensitive  to  register  allocation  in  the  first  place,  then  register  allocation 
can  avoid  spill  and  shuffle  code.  For  instance,  when  scheduling  code,  the  scheduler  makes 
sure  that  there  are  no  more  than  iV  simultaneously  live  non-transparent  live  ranges  at  any 
given  point.  Whole  or  partial  live  ranges  are  allowed  to  be  spilled  to  satisfy  the  condition 
(luring  code  scheduling.  Because  spill  code  is  produced  during  code  scheduling,  the  spill 
code  can  be  scheduled  right  away  instead  of  after  register  allocation. 
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